The goal of ensemble regression is to combine several models in order to improve the prediction accuracy in learning problems with a numerical target variable. The process of ensemble learning can be divided into three phases: the generation phase, the pruning phase, and the integration phase. We discuss different approaches to each of these phases that are able to deal with the regression problem, categorizing them in terms of their relevant characteristics and linking them to contributions from different fields. Furthermore, this work makes it possible to identify interesting areas for future research.
translated by 谷歌翻译
The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as pre-dictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics. $ The feature extraction algorithms and the FRESH algorithm itself, all of which are described in this work, have been implemented in a open source python package called tsfresh. Its source code be found at https://github.com/blue-yonder/tsfresh.
translated by 谷歌翻译
文本分类(TC)是将一组文档自动组织为一组预定义类别的任务。在过去几年中,人们越来越关注以数字形式使用文档,这使得文本分类成为一个具有挑战性的问题。文本分类中最重要的问题是它的大量功能。这些特征中的大多数是冗余的,嘈杂的和不相关的,导致与大多数分类器的过度拟合。因此,特征提取是提高文本分类器的整体准确性和性能的重要步骤。在本文中,我们将概述使用主成分分析(PCA)作为具有各种分类器的特征提取。观察到使用PCA减少数据维数后分类器的性能提高。在三个UCI数据集Classic03,CNAE-9和DBWorld电子邮件上进行实验。我们将使用PCA的分类性能结果与流行的和众所周知的文本分类器进行比较。结果表明,使用PCA可以大大提高大多数分类器的分类性能。
translated by 谷歌翻译
Ensemble methods are considered the state-of-the art solution for many machine learning challenges. Such methods improve the predictive performance of a single model by training multiple models and combining their predictions. This paper introduce the concept of ensemble learning, reviews traditional, novel and state-of-the-art ensemble methods and discusses current challenges and trends in the field. This article is categorized under: Algorithmic Development > Model Combining Technologies > Machine Learning Technologies > Classification K E Y W O R D S boosting, classifier combination, ensemble models, machine-learning, mixtures of experts, multiple classifier system, random forest 1 | INTRODUCTION Ensemble learning is an umbrella term for methods that combine multiple inducers to make a decision, typically in supervised machine learning tasks. An inducer, also referred as a base-learner, is an algorithm that takes a set of labeled examples as input and produces a model (e.g., a classifier or regressor) that generalizes these examples. By using the produced model, predictions can be drawn for new unlabeled examples. An ensemble inducer can be of any type of machine learning algorithm (e.g., decision tree, neural network, linear regression model, etc.). The main premise of ensemble learning is that by combining multiple models, the errors of a single inducer will likely be compensated by other inducers, and as a result, the overall prediction performance of the ensemble would be better than that of a single inducer. Ensemble learning is usually regarded as the machine learning interpretation for the wisdom of the crowd. This concept can be illustrated through the story of Sir Francis Galton (1822-1911) who was an English philosopher and statistician that conceived the basic concept of standard deviation and correlation. While visiting a livestock fair, Galton conducted a simple weight guessing contest. The participants were asked to guess the weight of an ox. Hundreds of people participated in this contest, but no one succeeded in guessing the weight: 1,198 pounds. Much to his surprise, Galton found that the average of all guesses came quite close to the exact weight: 1,198 pounds. In this experiment, Galton revealed the power of combining many predictions in order to obtain an accurate prediction. Ensemble methods manifest this concept in machine learning challenges, where they result in improved predictive performance compared to a single model. In addition, when the computational cost of the participating inducers is low (e.g., decision tree), ensemble models are often very efficient.
translated by 谷歌翻译
In the present study k-Nearest Neighbor classification method, have been studied for economic forecasting. Due to the effects of companies' financial distress on stakeholders, financial distress prediction models have been one of the most attractive areas in financial research. In recent years, after the global financial crisis, the number of bankrupt companies has risen. Since companies' financial distress is the first stage of bankruptcy, using financial ratios for predicting financial distress have attracted too much attention of the academics as well as economic and financial institutions. Although in recent years studies on predicting companies' financial distress in Iran have been increased, most efforts have exploited traditional statistical methods; and just a few studies have used nonparametric methods. Recent studies demonstrate this method is more capable than other methods.
translated by 谷歌翻译
As opposed to traditional supervised learning, multiple-instance learning concerns the problem of classifying a bag of instances, given bags that are labeled by a teacher as being overall positive or negative. Current research mainly concentrates on adapting traditional concept learning to solve this problem. In this paper we investigate the use of lazy learning and Hausdorff distance to approach the multiple-instance problem. We present two variants of the K-nearest neighbor algorithm, called Bayesian-KNN and Citation-KNN, solving the multiple-instance problem. Experiments on the Drug discovery benchmark data show that both algorithms are competitive with the best ones conceived in the concept learning framework. Further work includes exploring of a combination of lazy and eager multiple-instance problem classifiers.
translated by 谷歌翻译
算法选择和超参数调整仍然是机器学习中最具挑战性的两项任务。机器学习应用程序的数量增长速度远远快于机器学习数据的数量,因此我们看到对学习过程的有效自动化的需求不断增加。在这里,我们介绍了OBOE,一种用于时间约束模型选择和超参数调整的算法。利用相似性之间的相似性,OBOE通过协同过滤发现了有前途的算法和超参数配置。我们的系统在时间约束下探索这些模型,因此可以提供快速初始化以热启动更细粒度的优化方法。我们的方法的一个新颖方面是基于最优实验设计的时间约束矩阵完成中的主动学习的新启发。我们的实验表明,OBOE在监督学习问题的测试平台上比竞争方法更快地提供最先进的性能。
translated by 谷歌翻译
机器学习使计算机有资格与数据同化,而无需编程[1,2]。机器学习可以分为监督和非监督学习。在有监督的学习中,计算机学习一个目标,即根据训练输入 - 输出对来描绘输出的输入[3]。最有效和广泛使用的监督学习算法是K-NearestNeighbors(KNN),支持向量机(SVM),大边界最近邻(LMNN)和扩展最近邻(ENN)。本文的主要贡献是在UCI机器学习库中的11个不同的数据集上实现这些优雅的学习算法,以观察所有数据集上每个算法的准确性变化。分析算法的准确性将使我们简要了解机器学习算法与数据维度之间的关系。所有算法都在Matlab中开发。通过这样的准确观察,可以在KNN,SVM,LMNN和ENN之间建立关于它们在每个数据集上的性能的比较。
translated by 谷歌翻译
Feature selection has been widely used in data mining and machine learning tasks to make a model with a small number of features which improves the classifier's accuracy. In this paper, a novel hybrid feature selection algorithm based on particle swarm optimization is proposed. The proposed method called HPSO-LS uses a local search strategy which is embedded in the particle swarm optimization to select the less correlated and salient feature subset. The goal of the local search technique is to guide the search process of the particle swarm optimization to select distinct features by considering their correlation information. Moreover, the proposed method utilizes a subset size determination scheme to select a subset of features with reduced size. The performance of the proposed method has been evaluated on 13 benchmark classification problems and compared with five state-of-the-art feature selection methods. Moreover, HPSO-LS has been compared with four well-known filter-based methods including information gain, term variance, fisher score and mRMR and five well-known wrapper-based methods including genetic algorithm, particle swarm optimization, simulated annealing and ant colony optimization. The results demonstrated that the proposed method improves the classification accuracy compared with those of the filter based and wrapper-based feature selection methods. Furthermore, several performed statistical tests show that the proposed method's superiority over the other methods is statistically significant.
translated by 谷歌翻译
This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification,
translated by 谷歌翻译
In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dis-sertation. We propose a new ensemble learning framework-Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle-active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible. Lastly, we apply the proposed learning methods to a real-world bioinformatics problem-protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem. iv DEDICATION To my parents, To my grandfather, who cherished me since a kid, To my grandmother, who loved me the most, but I don't know her name, To my love. v ACKNOWLEDGEMENTS
translated by 谷歌翻译
模式分析通常需要预处理阶段来提取选择特征,以帮助分类,预测或聚类阶段以更好的方式区分或表示数据。这一要求的原因是原始数据复杂且难以处理而无需事先提取或选择适当的特征。本文回顾了不同常用的特征选择和提取方法的理论和动机,并介绍了它们的一些应用。对于这些方法也示出了一些数字实现。最后,比较了方法的选择性和提取方法。
translated by 谷歌翻译
机器学习模型变得越来越复杂,以便更好地复杂功能。虽然在许多领域都富有成效,但增加的复杂性是以模型可解释性为代价的。曾经流行的近邻(kNN)方法发现并使用最相似的推理数据,近几十年来由于与其他技术相比存在大量问题而受到的关注要少得多。我们证明了kNN的许多这些历史问题都可以克服,我们的贡献不仅适用于机器学习,还适用于在线学习,数据合成,异常检测,模型压缩和强化学习,而不会牺牲可解释性。我们介绍了kNN和信息理论之间的综合,我们希望能够为天生可解释和可审计的模型提供一条清晰的路径。通过这项工作,我们希望能够将kNN与信息理论相结合,作为全面审计机器学习和人工智能的有希望的途径。
translated by 谷歌翻译
Decision tree is a simple and effective method and it can be supplemented with ensemble methods to improve its performance. Random Forest and Rotation Forest are two approaches which are perceived as "classic" at present. They can build more accurate and diverse classifiers than Bagging and Boosting by introducing the diversities namely randomly chosen a subset of features or rotated feature space. However, the splitting criteria used for constructing each tree in Random Forest and Rotation Forest are Gini index and information gain ratio respectively, which are skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. Hellinger distance decision tree (HDDT) was proposed by Chawla, which is skew-insensitive. Especially, bagged unpruned HDDT has proven to be an effective way to deal with highly imbalanced problem. Nevertheless, the bootstrap sampling used in Bagging can lead to ensembles of low diversity compared to Random Forest and Rotation Forest. In order to combine the skew-insensitivity of HDDT and the diversities of Random Forest and Rotation Forest, we use Hellinger distance as the splitting criterion for building each tree in Random Forest and Rotation Forest respectively. An experimental framework is performed across a wide range of highly imbalanced datasets to investigate the effectiveness of Hellinger distance, information gain ratio and Gini index which are used as the splitting criteria in ensembles of decision trees including Bagging, Boosting, Random Forest and Rotation Forest. In addition, Balanced Random Forest is also included in the experiment since it is designed to tackle class imbalance problem. The experimental results, which contrasted through nonparametric statistical tests, demonstrate that using Hellinger distance as the splitting criterion to build individual decision tree in forest can improve the performances of Random Forest and Rotation Forest for highly imbalanced classification.
translated by 谷歌翻译
In this paper we study random forests through their connection with a new framework of adaptive nearest neighbor methods. We first introduce a concept of potential nearest neighbors (k-PNN's) and show that random forests can be seen as adaptively weighted k-PNN methods. Various aspects of random forests are then studied from this perspective. We investigate the effect of terminal node sizes and splitting schemes on the performance of random forests. It has been commonly believed that random forests work best using largest trees possible. We derive a lower bound to the rate of the mean squared error of regression random forests with non-adaptive splitting schemes and show that, asymptotically, growing largest trees in such random forests is not optimal. However, it may take a very large sample size for this asymptotic result to kick in for high dimensional problems. We illustrate with simulations the effect of terminal node sizes on the prediction accuracy of random forests with other splitting schemes. In general, it is advantageous to tune the terminal node size for best performance of random forests. We further show that random forests with adaptive splitting schemes assign weights to k-PNN's in a desirable way: for the estimation at a given target point, these random forests assign voting weights to the k-PNN's of the target point according to the local importance of different input variables. We propose a new simple splitting scheme that achieves desirable adaptivity in a straightforward fashion. This simple scheme can be combined with existing algorithms. The resulting algorithm is computationally faster, and gives comparable results. Other possible aspects of random forests, such as using linear combinations in splitting, are also discussed. Simulations and real datasets are used to illustrate the results.
translated by 谷歌翻译
Many real-life problems can be described as unbalanced, where the number of instances belonging to one of the classes is much larger than the numbers in other classes. Examples are spam detection, credit card fraud detection or medical diagnosis. Ensembles of classifiers have acquired popularity in this kind of problems for their ability to obtain better results than individual classifiers. The most commonly used techniques by those ensembles especially designed to deal with imbalanced problems are for example Re-weighting, Oversampling and Undersampling. Other techniques, originally intended to increase the ensemble diversity, have not been systematically studied for their effect on imbalanced problems. Among these are Random Oracles, Disturbing Neighbors, Random Feature Weights or Rotation Forest. This paper presents an overview and an experimental study of various ensemble-based methods for imbalanced problems, the methods have been tested in its original form and in conjunction with several diversity-increasing techniques, using 84 imbalanced data sets from two well known repositories. This paper shows that these diversity-increasing techniques significantly improve the performance of ensemble methods for imbalanced problems and provides some ideas about when it is more convenient to use these diversifying techniques.
translated by 谷歌翻译
近年来,复杂文档和文本的数量呈指数增长,需要更深入地了解机器学习方法,才能在许多应用程序中准确地对文本进行分类。许多机器学习方法在自然语言处理方面取得了超越的成果。这些学习算法的成功依赖于它们能够理解数据中的复杂模型和非线性关系。然而,为文本分类找到合适的结构,体系结构和技术对研究人员来说是一个挑战。在本文中,讨论了文本分类算法的简要概述。本概述涵盖了不同的文本特征提取,降维方法,现有算法和技术以及评估方法。最后,讨论了每种技术的局限性及其在现实问题中的应用。
translated by 谷歌翻译
One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.
translated by 谷歌翻译
Instance-based learning algorithms are often faced with the problem of deciding which instances to store for use during generalization. Storing too many instances can result in large memory requirements and slow execution speed, and can cause an oversensitivity to noise. This paper has two main purposes. First, it provides a survey of existing algorithms used to reduce storage requirements in instance-based learning algorithms and other exemplar-based algorithms. Second, it proposes six additional reduction algorithms called DROP1-DROP5 and DEL (three of which were first described in Wilson & Martinez, 1997c, as RT1-RT3) that can be used to remove instances from the concept description. These algorithms and 10 algorithms from the survey are compared on 31 classification tasks. Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest average generalization accuracy in these experiments, especially in the presence of uniform class noise.
translated by 谷歌翻译
k-最近邻方法基于其邻域中包含的信息对查询样本执行分类任务。先前对k-最近邻算法的研究通常通过结合邻域中每个样本的支持来实现aclass的决策值。他们通常分别考虑最近邻居,并且可能丢失对分类重要的整体邻居信息,例如,分发信息。本文提出了一种新的局部学习方法,通过局部分布来组织邻域信息。在该方法中,对邻域中的附加分布信息进行估计和组织;分类决策基于最大后验概率,该最大后验概率是根据邻域中的局部分布估计的。此外,基于局部分布,我们生成广义的局部分类形式,通过调整参数可以有效地应用于各种数据集。我们使用合成和实时数据集来评估所提方法的分类性能;实验结果证明了该方法的尺寸可扩展性,效率,有效性和鲁棒性,与其他一些最先进的分类器相比。结果表明,该方法在广泛的领域中是有效的和有希望的。
translated by 谷歌翻译