We study the problem of multiclass classification with an extremely large number of classes (k), with the goal of obtaining train and test time complexity logarithmic in the number of classes. We develop top-down tree construction approaches for constructing logarithmic depth trees. On the theoretical front, we formulate a new objective function, which is optimized at each node of the tree and creates dynamic partitions of the data which are both pure (in terms of class labels) and balanced. We demonstrate that under favorable conditions, we can construct logarithmic depth trees that have leaves with low label entropy. However, the objective function at the nodes is challenging to optimize computationally. We address the empirical problem with a new online decision tree construction procedure. Experiments demonstrate that this online algorithm quickly achieves improvement in test error compared to more common logarithmic training time approaches, which makes it a plausible method in computationally constrained large-k applications.
translated by 谷歌翻译
We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that scales with the depth of the tree. Motivated by this analysis, we propose the first online algorithm which prov-ably constructs a logarithmic depth tree on the set of labels to solve this problem. We test the algorithm empirically, showing that it works succesfully on a dataset with roughly 10 6 labels.
translated by 谷歌翻译
在许多情况下,决策者希望学习从个人的可观察特征到行动的规则或政策。示例包括选择要发送消费者的优惠,价格,广告或电子邮件,以及确定为患者预先指定哪种药物的问题。虽然越来越多的文献专门讨论这个问题,但大多数现有结果都集中在数据来自随机实验的情况,而且,只有两种可能的行为,例如给予患者药物。在本文中,我们研究了具有观测数据的离线多动作策略学习问题,以及政策可能需要遵守预算约束或属于受限制的策略类(如决策树)。我们建立在有效半参数推理理论的基础上,以提出并实现一种实现渐近极小极大最优遗憾的策略学习算法。据我们所知,这是多动作设置中此类型的第一个结果,它提供了比现有的学习算法更大的性能改进。然后,我们考虑在实施我们的方法时出现的额外计算挑战,其中策略被限制为决策树的形式。我们提出了两种不同的方法,一种是使用混合整数程序公式,另一种是使用基于树搜索的算法。
translated by 谷歌翻译
We consider the problem of (macro) F-measure maximization in the context of extreme multi-label classification (XMLC), i.e., multi-label classification with extremely large label spaces. We investigate several approaches based on recent results on the maximization of complex performance measures in binary classification. According to these results, the F-measure can be maximized by properly thresholding conditional class probability estimates. We show that a na¨ıvena¨ıve adaptation of this approach can be very costly for XMLC and propose to solve the problem by classifiers that efficiently deliver sparse probability estimates (SPEs), that is, probability estimates restricted to the most probable labels. Empirical results provide evidence for the strong practical performance of this approach.
translated by 谷歌翻译
极端多标签分类(XMLC)是使用从极大的可能标签池中选择的一小部分相关标签来标记实例的问题。通过将标签组织为树,可以有效地处理大标签空间,就像通常用于多类问题的分层softmax(HSM)方法一样。在本文中,我们研究了最近为解决XMLC问题而设计的概率标签树(PLT)。通过使用精确度@ k作为模型评估度量,可以看出PLT是HSM的一种无悔的多标签泛化。重要的是,我们证明了一个标签启发式算法 - 一种常用于HSM的多标签多类别的缩减技术 - 在一般情况下并不一致。我们还表明,我们的PLT实现(称为asextremeText(XT))获得了比使用一个标签启发式和XML-CNN(一个专门为XMLC问题设计的深度网络)的HSM更好的结果。此外,XT在统计性能,模型大小和预测时间方面对许多最先进的方法具有竞争力,这使得可以在在线系统中进行部署。
translated by 谷歌翻译
Label tree classifiers are commonly used for efficient multi-class and multi-label classification. They represent a predictive model in the form of a tree-like hierarchy of (internal) classifiers, each of which is trained on a simpler (often binary) subproblem, and predictions are made by (greedily) following these classifiers' decisions from the root to a leaf of the tree. Unfortunately, this approach does normally not assure consistency for different losses on the original prediction task, even if the internal classifiers are consistent for their subtask. In this paper, we thoroughly analyze a class of methods referred to as proba-bilistic classifier trees (PCTs). Thanks to training probabilistic classifiers at internal nodes of the hierarchy, these methods allow for searching the tree-structure in a more sophisticated manner, thereby producing predictions of a less greedy nature. Our main result is a regret bound for 0/1 loss, which can easily be extended to ranking-based losses. In this regard, PCTs nicely complement a related approach called filter trees (FTs), and can indeed be seen as a natural alternative thereof. We compare the two approaches both theoretically and empirically.
translated by 谷歌翻译
We present an algorithm, called the Offset Tree, for learning to make decisions in situations where the payoff of only one choice is observed, rather than all choices. The algorithm reduces this setting to binary classification, allowing one to reuse of any existing, fully supervised binary classification algorithm in this partial information setting. We show that the Offset Tree is an optimal reduction to binary classification. In particular, it has regret at most (k − 1) times the regret of the binary classifier it uses (where k is the number of choices), and no reduction to binary classification can do better. This reduction is also computationally optimal, both at training and test time, requiring just O(log 2 k) work to train on an example or make a prediction. Experiments with the Offset Tree show that it generally performs better than several alternative approaches.
translated by 谷歌翻译
机器学习中最基本的问题之一是比较例子:给定一对对象,我们想要返回一个表示(dis)相似度的值。相似性通常是特定于任务的,并且预定义的距离可能表现不佳,从而导致在度量学习中工作。然而,能够学习相似性敏感距离函数也预先假定对于手头的对象的丰富的,有辨别力的表示。在本论文中,我们提出了两端的贡献。在论文的第一部分中,假设数据具有良好的表示,我们提出了一种用于度量学习的公式,与先前的工作相比,它更直接地尝试优化k-NN精度。我们还提出了这个公式的扩展,用于kNN回归的度量学习,不对称相似学习和汉明距离的判别学习。在第二部分中,我们考虑我们处于有限计算预算的情况,即在可能度量的空间上进行优化是不可行的,但是仍然需要访问标签感知距离度量。我们提出了一种简单,计算成本低廉的方法,用于估计仅依靠梯度估计,讨论理论和实验结果的良好动机。在最后一部分,我们讨论代表性问题,考虑组等变卷积神经网络(GCNN)。等效tosymmetry转换在GCNNs中明确编码;经典的CNN是最简单的例子。特别地,我们提出了一种用于球形数据的SO(3) - 等变神经网络架构,它完全在傅立叶空间中运行,同时也为完全傅立叶神经网络的设计提供了形式,这与任何连续紧凑组的动作是等效的。
translated by 谷歌翻译
Text of abstract We present a family of pairwise tournaments reducing k-class classification to binary classification. These reductions are provably robust against a constant fraction of binary errors, simultaneously matching the best possible computation O(log k) and regret O(1). The construction also works for robustly selecting the best of k-choices by tournament. We strengthen previous results by defeating a more powerful adversary than previously addressed while providing a new form of analysis. In this setting, the error correcting tournament has depth O(log k) while using O(k log k) comparators, both optimal up to a small constant.
translated by 谷歌翻译
The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms , showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.
translated by 谷歌翻译
We present a new method, filter tree, for reducing k-class classification to binary classification. The filter tree is provably consistent in the sense that given an optimal binary classifier, the reduction yields an optimal multiclass classifier. (The commonly used tree approach is provably inconsistent.) We show that the filter tree is robust: It suffers multiclass regret at most log 2 k times the binary regret. The method can also be used for cost-sensitive multiclass classification, where each prediction may have a different associated loss. The resulting regret bound is superior to the guarantees provided by all previous methods.
translated by 谷歌翻译
We consider multi-class classification where the predictor has a hierarchical structure that allows for a very large number of labels both at train and test time. The predictive power of such models can heavily depend on the structure of the tree, and although past work showed how to learn the tree structure, it expected that the feature vectors remained static. We provide a novel algorithm to simultaneously perform representation learning for the input data and learning of the hierarchical predictor. Our approach optimizes an objective function which favors balanced and easily-separable multi-way node partitions. We theoretically analyze this objective, showing that it gives rise to a boosting style property and a bound on classification error. We next show how to extend the algorithm to conditional density estimation. We empirically validate both variants of the algorithm on text classification and language mod-eling, respectively, and show that they compare favorably to common baselines in terms of accuracy and running time.
translated by 谷歌翻译
Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become compu-tationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster.
translated by 谷歌翻译
我们描述了MELEE,一种用于在交互式上下文强盗设置中学习良好探索策略的元学习算法。在这里,算法必须基于上下文采取行动,并且仅基于来自所采取的行动的奖励信号来学习,从而产生探索/利用权衡.MELEE通过基于合成数据学习关于离线任务的良好探索策略来解决这种权衡。 ,它可以模拟contextualbandit设置。基于这些模拟,MELEE使用模仿学习策略来学习一个好的探索策略,然后可以在测试时应用于真正的背景强盗任务。我们在一组300个真实世界数据集上比较了MELEE与七个强基线背景强盗算法,在大多数情况下,它优于备选方案,特别是当奖励差异很大时。最后,我们证明了拥有丰富的特征表示来学习如何探索的重要性。
translated by 谷歌翻译
This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer from this phenomenon. The standard machine learning algorithms yield better prediction performance with balanced datasets. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes. We also propose an efficient way of selecting informative instances from a smaller pool of samples for active learning which does not necessitate a search through the entire dataset. The proposed method yields an efficient querying system and allows active learning to be applied to very large datasets. Our experimental results show that with an early stopping criteria, active learning achieves a fast solution with competitive prediction performance in imbalanced data classification.
translated by 谷歌翻译
Hashing based approximate nearest neighbor (ANN) search in huge databases has become popular owing to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resulting hashes are either not very accurate or inefficient. Moreover these methods are designed for a given metric similarity. On the contrary, semantic similarity is usually given in terms of pairwise labels of samples. There exist supervised hashing methods that can handle such semantic similarity but they are prone to overfitting when labeled data is small or noisy. In this work, we propose a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled set. Based on this framework, we present three different semi-supervised hashing methods, including orthogonal hashing, non-orthogonal hashing, and sequential hashing. Particularly, the sequential hashing method generates robust codes in which each hash function is designed to correct the errors made by the previous ones. We further show that the sequential learning paradigm can be extended to unsupervised domains where no labeled pairs are available. Extensive experiments on four large datasets (up to 80 million samples) demonstrate the superior performance of the proposed SSH methods over state-of-the-art supervised and unsupervised hashing techniques.
translated by 谷歌翻译
本文提出了MAXQ分层强化学习方法,该方法基于将目标马尔可夫决策过程(MDP)分解为较小MDP的层次结构,并将目标MDP的值函数分解为较小MDP的值函数的加法组合。本文定义了MAXQ层次结构,证明了其代表性功率的正式结果,并为状态抽象的安全使用建立了五个条件。本文提出了一种在线无模型学习算法MAXQ-Q,并证明它将概率1收敛到一种即使在存在五种状态抽象的情况下,本地最优策略也称为递归最优策略。本文通过三个领域的一系列实验来评估MAXQ表示和MAXQ-Q,并通过实验证明MAXQ-Q(具有状态抽象)收敛于递归最优策略比平坦Q学习更快。 MAXQ学习值函数的表示这一事实具有以下重要优点:它可以通过类似于策略迭代的策略改进步骤的过程来计算和执行改进的非分层策略。本文通过实验证明了这种非等级执行的有效性。最后,本文最后对相关工作进行了比较,并讨论了分层强化学习中的设计权衡。
translated by 谷歌翻译
IJRR 000(00):1-40 ©The Author(s) 2010 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI:doi number Abstract Sampling-based algorithms are viewed as practical solutions for high-dimensional motion planning. Recent progress has taken advantage of random geometric graph theory to show how asymptotic optimality can also be achieved with these methods. Achieving this desirable property for systems with dynamics requires solving a two-point boundary value problem (BVP) in the state space of the underlying dynamical system. It is difficult, however, if not impractical, to generate a BVP solver for a variety of important dynamical models of robots or physically simulated ones. Thus, an open challenge was whether it was even possible to achieve optimality guarantees when planning for systems without access to a BVP solver. This work resolves the above question and describes how to achieve asymptotic optimality for kinody-namic planning using incremental sampling-based planners by introducing a new rigorous framework. Two new methods, STABLE_SPARSE_RRT (SST) and SST * , result from this analysis, which are asymptotically near-optimal and optimal, respectively. The techniques are shown to converge fast to high-quality paths, while they maintain only a sparse set of samples, which makes them computationally efficient. The good performance of the planners is confirmed by experimental results using dynamical systems benchmarks, as well as physically simulated robots.
translated by 谷歌翻译
Machine learning algorithms have successfully entered industry through many real-world applications (e.g. , search engines and product recommendations). In these applications, the test-time CPU cost must be budgeted and accounted for. In this paper, we examine two main components of the test-time CPU cost, classifier evaluation cost and feature extraction cost, and show how to balance these costs with the classifier accuracy. Since the computation required for feature extraction dominates the test-time cost of a classifier in these settings, we develop two algorithms to efficiently balance the performance with the test-time cost. Our first contribution describes how to construct and optimize a tree of classifiers, through which test inputs traverse along individual paths. Each path extracts different features and is optimized for a specific sub-partition of the input space. Our second contribution is a natural reduction of the tree of classifiers into a cascade. The cascade is particularly useful for class-imbalanced data sets as the majority of instances can be early-exited out of the cascade when the algorithm is sufficiently confident in its prediction. Because both approaches only compute features for inputs that benefit from them the most, we find our trained classifiers lead to high accuracies at a small fraction of the computational cost.
translated by 谷歌翻译