We study the problem of multiclass classification with an extremely large number of classes (k), with the goal of obtaining train and test time complexity logarithmic in the number of classes. We develop top-down tree construction approaches for constructing logarithmic depth trees. On the theoretical front, we formulate a new objective function, which is optimized at each node of the tree and creates dynamic partitions of the data which are both pure (in terms of class labels) and balanced. We demonstrate that under favorable conditions, we can construct logarithmic depth trees that have leaves with low label entropy. However, the objective function at the nodes is challenging to optimize computationally. We address the empirical problem with a new online decision tree construction procedure. Experiments demonstrate that this online algorithm quickly achieves improvement in test error compared to more common logarithmic training time approaches, which makes it a plausible method in computationally constrained large-k applications.
translated by 谷歌翻译
We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that scales with the depth of the tree. Motivated by this analysis, we propose the first online algorithm which prov-ably constructs a logarithmic depth tree on the set of labels to solve this problem. We test the algorithm empirically, showing that it works succesfully on a dataset with roughly 10 6 labels.
translated by 谷歌翻译
在许多情况下,决策者希望学习从个人的可观察特征到行动的规则或政策。示例包括选择要发送消费者的优惠,价格,广告或电子邮件,以及确定为患者预先指定哪种药物的问题。虽然越来越多的文献专门讨论这个问题,但大多数现有结果都集中在数据来自随机实验的情况,而且,只有两种可能的行为,例如给予患者药物。在本文中,我们研究了具有观测数据的离线多动作策略学习问题,以及政策可能需要遵守预算约束或属于受限制的策略类(如决策树)。我们建立在有效半参数推理理论的基础上,以提出并实现一种实现渐近极小极大最优遗憾的策略学习算法。据我们所知,这是多动作设置中此类型的第一个结果,它提供了比现有的学习算法更大的性能改进。然后,我们考虑在实施我们的方法时出现的额外计算挑战,其中策略被限制为决策树的形式。我们提出了两种不同的方法,一种是使用混合整数程序公式,另一种是使用基于树搜索的算法。
translated by 谷歌翻译
极端多标签分类(XMLC)是使用从极大的可能标签池中选择的一小部分相关标签来标记实例的问题。通过将标签组织为树,可以有效地处理大标签空间,就像通常用于多类问题的分层softmax(HSM)方法一样。在本文中,我们研究了最近为解决XMLC问题而设计的概率标签树(PLT)。通过使用精确度@ k作为模型评估度量,可以看出PLT是HSM的一种无悔的多标签泛化。重要的是,我们证明了一个标签启发式算法 - 一种常用于HSM的多标签多类别的缩减技术 - 在一般情况下并不一致。我们还表明,我们的PLT实现(称为asextremeText(XT))获得了比使用一个标签启发式和XML-CNN(一个专门为XMLC问题设计的深度网络)的HSM更好的结果。此外,XT在统计性能,模型大小和预测时间方面对许多最先进的方法具有竞争力,这使得可以在在线系统中进行部署。
translated by 谷歌翻译
We consider the problem of (macro) F-measure maximization in the context of extreme multi-label classification (XMLC), i.e., multi-label classification with extremely large label spaces. We investigate several approaches based on recent results on the maximization of complex performance measures in binary classification. According to these results, the F-measure can be maximized by properly thresholding conditional class probability estimates. We show that a na¨ıvena¨ıve adaptation of this approach can be very costly for XMLC and propose to solve the problem by classifiers that efficiently deliver sparse probability estimates (SPEs), that is, probability estimates restricted to the most probable labels. Empirical results provide evidence for the strong practical performance of this approach.
translated by 谷歌翻译
Label tree classifiers are commonly used for efficient multi-class and multi-label classification. They represent a predictive model in the form of a tree-like hierarchy of (internal) classifiers, each of which is trained on a simpler (often binary) subproblem, and predictions are made by (greedily) following these classifiers' decisions from the root to a leaf of the tree. Unfortunately, this approach does normally not assure consistency for different losses on the original prediction task, even if the internal classifiers are consistent for their subtask. In this paper, we thoroughly analyze a class of methods referred to as proba-bilistic classifier trees (PCTs). Thanks to training probabilistic classifiers at internal nodes of the hierarchy, these methods allow for searching the tree-structure in a more sophisticated manner, thereby producing predictions of a less greedy nature. Our main result is a regret bound for 0/1 loss, which can easily be extended to ranking-based losses. In this regard, PCTs nicely complement a related approach called filter trees (FTs), and can indeed be seen as a natural alternative thereof. We compare the two approaches both theoretically and empirically.
translated by 谷歌翻译
We present an algorithm, called the Offset Tree, for learning to make decisions in situations where the payoff of only one choice is observed, rather than all choices. The algorithm reduces this setting to binary classification, allowing one to reuse of any existing, fully supervised binary classification algorithm in this partial information setting. We show that the Offset Tree is an optimal reduction to binary classification. In particular, it has regret at most (k − 1) times the regret of the binary classifier it uses (where k is the number of choices), and no reduction to binary classification can do better. This reduction is also computationally optimal, both at training and test time, requiring just O(log 2 k) work to train on an example or make a prediction. Experiments with the Offset Tree show that it generally performs better than several alternative approaches.
translated by 谷歌翻译
机器学习中最基本的问题之一是比较例子:给定一对对象,我们想要返回一个表示(dis)相似度的值。相似性通常是特定于任务的,并且预定义的距离可能表现不佳,从而导致在度量学习中工作。然而,能够学习相似性敏感距离函数也预先假定对于手头的对象的丰富的,有辨别力的表示。在本论文中,我们提出了两端的贡献。在论文的第一部分中,假设数据具有良好的表示,我们提出了一种用于度量学习的公式,与先前的工作相比,它更直接地尝试优化k-NN精度。我们还提出了这个公式的扩展,用于kNN回归的度量学习,不对称相似学习和汉明距离的判别学习。在第二部分中,我们考虑我们处于有限计算预算的情况,即在可能度量的空间上进行优化是不可行的,但是仍然需要访问标签感知距离度量。我们提出了一种简单,计算成本低廉的方法,用于估计仅依靠梯度估计,讨论理论和实验结果的良好动机。在最后一部分,我们讨论代表性问题,考虑组等变卷积神经网络(GCNN)。等效tosymmetry转换在GCNNs中明确编码;经典的CNN是最简单的例子。特别地,我们提出了一种用于球形数据的SO(3) - 等变神经网络架构,它完全在傅立叶空间中运行,同时也为完全傅立叶神经网络的设计提供了形式,这与任何连续紧凑组的动作是等效的。
translated by 谷歌翻译
The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms , showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.
translated by 谷歌翻译
Text of abstract We present a family of pairwise tournaments reducing k-class classification to binary classification. These reductions are provably robust against a constant fraction of binary errors, simultaneously matching the best possible computation O(log k) and regret O(1). The construction also works for robustly selecting the best of k-choices by tournament. We strengthen previous results by defeating a more powerful adversary than previously addressed while providing a new form of analysis. In this setting, the error correcting tournament has depth O(log k) while using O(k log k) comparators, both optimal up to a small constant.
translated by 谷歌翻译
We present a new method, filter tree, for reducing k-class classification to binary classification. The filter tree is provably consistent in the sense that given an optimal binary classifier, the reduction yields an optimal multiclass classifier. (The commonly used tree approach is provably inconsistent.) We show that the filter tree is robust: It suffers multiclass regret at most log 2 k times the binary regret. The method can also be used for cost-sensitive multiclass classification, where each prediction may have a different associated loss. The resulting regret bound is superior to the guarantees provided by all previous methods.
translated by 谷歌翻译
We consider multi-class classification where the predictor has a hierarchical structure that allows for a very large number of labels both at train and test time. The predictive power of such models can heavily depend on the structure of the tree, and although past work showed how to learn the tree structure, it expected that the feature vectors remained static. We provide a novel algorithm to simultaneously perform representation learning for the input data and learning of the hierarchical predictor. Our approach optimizes an objective function which favors balanced and easily-separable multi-way node partitions. We theoretically analyze this objective, showing that it gives rise to a boosting style property and a bound on classification error. We next show how to extend the algorithm to conditional density estimation. We empirically validate both variants of the algorithm on text classification and language mod-eling, respectively, and show that they compare favorably to common baselines in terms of accuracy and running time.
translated by 谷歌翻译
Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become compu-tationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster.
translated by 谷歌翻译
This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer from this phenomenon. The standard machine learning algorithms yield better prediction performance with balanced datasets. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. We also propose an efficient way of selecting informative instances from a smaller pool of samples for active learning which does not necessitate a search through the entire dataset. The proposed method yields an efficient querying system and allows active learning to be applied to very large datasets. Our experimental results show that with an early stopping criteria, active learning achieves a fast solution with competitive prediction performance in imbalanced data classification.
translated by 谷歌翻译
受到AlphaGo Zero(AGZ)成功的启发,它利用蒙特卡罗树搜索(MCTS)和神经网络监督学习来学习最优政策和价值功能,在这项工作中,我们专注于正式建立这样一种方法确实找到了渐近的最优政策,以及在此过程中建立非渐近保证。我们将重点关注无限期贴现马尔可夫决策过程以确定结果。首先,它需要在文献中建立MCTS声称的属性,对于任何给定的查询状态,MCTS为具有足够模拟MDP步骤的状态提供近似值函数。我们提供了非渐近分析,通过分析非固定多臂匪装置来建立这种性质。我们的证据表明MCTS需要利用多项式而不是对数“上置信度限制”来建立其期望的性能 - 有趣的是,AGZ选择这样的多项式约束。使用它作为构建块,结合最近邻监督学习,我们认为MCTS充当“政策改进”运营商;它具有自然的“自举”属性,可以迭代地改进所有状态的值函数逼近,这是由于与超级学习相结合,尽管仅在有限多个状态下进行评估。实际上,我们建立了学习$ \ _ \ _ \ _ \ _ \ _ \ _间/ $ inform中值函数的$ \ varepsilon $近似值,MCTS与最近邻居相结合需要samplesscaling为$ \ widetilde {O} \ big(\ varepsilon ^ { - (d + 4)} \ big)$,其中$ d $是状态空间的维度。这几乎是最优的,因为$ \ widetilde {\ Omega} \ big(\ varepsilon ^ { - (d + 2)} \ big)的minimax下限。$
translated by 谷歌翻译
IJRR 000(00):1-40 ©The Author(s) 2010 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI:doi number Abstract Sampling-based algorithms are viewed as practical solutions for high-dimensional motion planning. Recent progress has taken advantage of random geometric graph theory to show how asymptotic optimality can also be achieved with these methods. Achieving this desirable property for systems with dynamics requires solving a two-point boundary value problem (BVP) in the state space of the underlying dynamical system. It is difficult, however, if not impractical, to generate a BVP solver for a variety of important dynamical models of robots or physically simulated ones. Thus, an open challenge was whether it was even possible to achieve optimality guarantees when planning for systems without access to a BVP solver. This work resolves the above question and describes how to achieve asymptotic optimality for kinody-namic planning using incremental sampling-based planners by introducing a new rigorous framework. Two new methods, STABLE_SPARSE_RRT (SST) and SST * , result from this analysis, which are asymptotically near-optimal and optimal, respectively. The techniques are shown to converge fast to high-quality paths, while they maintain only a sparse set of samples, which makes them computationally efficient. The good performance of the planners is confirmed by experimental results using dynamical systems benchmarks, as well as physically simulated robots.
translated by 谷歌翻译
Hashing based approximate nearest neighbor (ANN) search in huge databases has become popular owing to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resulting hashes are either not very accurate or inefficient. Moreover these methods are designed for a given metric similarity. On the contrary, semantic similarity is usually given in terms of pairwise labels of samples. There exist supervised hashing methods that can handle such semantic similarity but they are prone to overfitting when labeled data is small or noisy. In this work, we propose a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled set. Based on this framework, we present three different semi-supervised hashing methods, including orthogonal hashing, non-orthogonal hashing, and sequential hashing. Particularly, the sequential hashing method generates robust codes in which each hash function is designed to correct the errors made by the previous ones. We further show that the sequential learning paradigm can be extended to unsupervised domains where no labeled pairs are available. Extensive experiments on four large datasets (up to 80 million samples) demonstrate the superior performance of the proposed SSH methods over state-of-the-art supervised and unsupervised hashing techniques.
translated by 谷歌翻译
In a multi-armed bandit problem, an online algorithm chooses from a set of strategies in a sequence of trials so as to maximize the total payoff of the chosen strategies. While the performance of bandit algorithms with a small finite strategy set is quite well understood, bandit problems with large strategy sets are still a topic of very active investigation, motivated by practical applications such as online auctions and web advertisement. The goal of such research is to identify broad and natural classes of strategy sets and payoff functions which enable the design of efficient solutions. In this work we study a very general setting for the multi-armed bandit problem in which the strategies form a metric space, and the payoff function satisfies a Lipschitz condition with respect to the metric. We refer to this problem as the Lipschitz MAB problem. We present a solution for the multi-armed bandit problem in this setting. That is, for every metric space we define an isometry invariant which bounds from below the performance of Lipschitz MAB algorithms for this metric space, and we present an algorithm which comes arbitrarily close to meeting this bound. Furthermore, our technique gives even better results for benign payoff functions. We also address the full-feedback ("best expert") version of the problem, where after every round the payoffs from all arms are revealed. respectively). Compared to the conference publications, the manuscript contains full proofs and a significantly revised presentation. In particular, it develops new terminology and modifies the proof outlines to unify the technical exposition of the two papers. The manuscript also features an updated discussion of the follow-up work and open questions. All results on the zooming algorithm and the max-min-covering dimension are from Kleinberg et al. (2008c); all results on regret dichotomies and on Lipschitz experts are from Kleinberg and Slivkins (2010).
translated by 谷歌翻译
Most research on nearest neighbor algorithms in the literature has been focused on the Euclidean case. In many practical search problems however, the underlying metric is non-Euclidean. Nearest neighbor algorithms for general metric spaces are quite weak, which motivates a search for other classes of metric spaces that can be tractably searched. In this paper, we develop an efficient dynamic data structure for nearest neighbor queries in growth-constrained metrics. These metrics satisfy the property that for any point q and number r the ratio between numbers of points in balls of radius 2r and r is bounded by a constant. Spaces of this kind may occur in networking applications , such as the Internet or Peer-to-peer networks, and vector quantization applications, where feature vectors fall into low-dimensional manifolds within high-dimensional vector spaces.
translated by 谷歌翻译