The goal of ensemble regression is to combine several models in order to improve the prediction accuracy in learning problems with a numerical target variable. The process of ensemble learning can be divided into three phases: the generation phase, the pruning phase, and the integration phase. We discuss different approaches to each of these phases that are able to deal with the regression problem, categorizing them in terms of their relevant characteristics and linking them to contributions from different fields. Furthermore, this work makes it possible to identify interesting areas for future research.
translated by 谷歌翻译
文本分类(TC)是将一组文档自动组织为一组预定义类别的任务。在过去几年中,人们越来越关注以数字形式使用文档,这使得文本分类成为一个具有挑战性的问题。文本分类中最重要的问题是它的大量功能。这些特征中的大多数是冗余的,嘈杂的和不相关的,导致与大多数分类器的过度拟合。因此,特征提取是提高文本分类器的整体准确性和性能的重要步骤。在本文中,我们将概述使用主成分分析(PCA)作为具有各种分类器的特征提取。观察到使用PCA减少数据维数后分类器的性能提高。在三个UCI数据集Classic03,CNAE-9和DBWorld电子邮件上进行实验。我们将使用PCA的分类性能结果与流行的和众所周知的文本分类器进行比较。结果表明,使用PCA可以大大提高大多数分类器的分类性能。
translated by 谷歌翻译
The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as pre-dictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics. $ The feature extraction algorithms and the FRESH algorithm itself, all of which are described in this work, have been implemented in a open source python package called tsfresh. Its source code be found at https://github.com/blue-yonder/tsfresh.
translated by 谷歌翻译
In the present study k-Nearest Neighbor classification method, have been studied for economic forecasting. Due to the effects of companies' financial distress on stakeholders, financial distress prediction models have been one of the most attractive areas in financial research. In recent years, after the global financial crisis, the number of bankrupt companies has risen. Since companies' financial distress is the first stage of bankruptcy, using financial ratios for predicting financial distress have attracted too much attention of the academics as well as economic and financial institutions. Although in recent years studies on predicting companies' financial distress in Iran have been increased, most efforts have exploited traditional statistical methods; and just a few studies have used nonparametric methods. Recent studies demonstrate this method is more capable than other methods.
translated by 谷歌翻译
Ensemble methods are considered the state-of-the art solution for many machine learning challenges. Such methods improve the predictive performance of a single model by training multiple models and combining their predictions. This paper introduce the concept of ensemble learning, reviews traditional, novel and state-of-the-art ensemble methods and discusses current challenges and trends in the field. This article is categorized under: Algorithmic Development > Model Combining Technologies > Machine Learning Technologies > Classification K E Y W O R D S boosting, classifier combination, ensemble models, machine-learning, mixtures of experts, multiple classifier system, random forest 1 | INTRODUCTION Ensemble learning is an umbrella term for methods that combine multiple inducers to make a decision, typically in supervised machine learning tasks. An inducer, also referred as a base-learner, is an algorithm that takes a set of labeled examples as input and produces a model (e.g., a classifier or regressor) that generalizes these examples. By using the produced model, predictions can be drawn for new unlabeled examples. An ensemble inducer can be of any type of machine learning algorithm (e.g., decision tree, neural network, linear regression model, etc.). The main premise of ensemble learning is that by combining multiple models, the errors of a single inducer will likely be compensated by other inducers, and as a result, the overall prediction performance of the ensemble would be better than that of a single inducer. Ensemble learning is usually regarded as the machine learning interpretation for the wisdom of the crowd. This concept can be illustrated through the story of Sir Francis Galton (1822-1911) who was an English philosopher and statistician that conceived the basic concept of standard deviation and correlation. While visiting a livestock fair, Galton conducted a simple weight guessing contest. The participants were asked to guess the weight of an ox. Hundreds of people participated in this contest, but no one succeeded in guessing the weight: 1,198 pounds. Much to his surprise, Galton found that the average of all guesses came quite close to the exact weight: 1,198 pounds. In this experiment, Galton revealed the power of combining many predictions in order to obtain an accurate prediction. Ensemble methods manifest this concept in machine learning challenges, where they result in improved predictive performance compared to a single model. In addition, when the computational cost of the participating inducers is low (e.g., decision tree), ensemble models are often very efficient.
translated by 谷歌翻译
机器学习使计算机有资格与数据同化,而无需编程[1,2]。机器学习可以分为监督和非监督学习。在有监督的学习中,计算机学习一个目标,即根据训练输入 - 输出对来描绘输出的输入[3]。最有效和广泛使用的监督学习算法是K-NearestNeighbors(KNN),支持向量机(SVM),大边界最近邻(LMNN)和扩展最近邻(ENN)。本文的主要贡献是在UCI机器学习库中的11个不同的数据集上实现这些优雅的学习算法,以观察所有数据集上每个算法的准确性变化。分析算法的准确性将使我们简要了解机器学习算法与数据维度之间的关系。所有算法都在Matlab中开发。通过这样的准确观察,可以在KNN,SVM,LMNN和ENN之间建立关于它们在每个数据集上的性能的比较。
translated by 谷歌翻译
算法选择和超参数调整仍然是机器学习中最具挑战性的两项任务。机器学习应用程序的数量增长速度远远快于机器学习数据的数量,因此我们看到对学习过程的有效自动化的需求不断增加。在这里,我们介绍了OBOE,一种用于时间约束模型选择和超参数调整的算法。利用相似性之间的相似性,OBOE通过协同过滤发现了有前途的算法和超参数配置。我们的系统在时间约束下探索这些模型,因此可以提供快速初始化以热启动更细粒度的优化方法。我们的方法的一个新颖方面是基于最优实验设计的时间约束矩阵完成中的主动学习的新启发。我们的实验表明,OBOE在监督学习问题的测试平台上比竞争方法更快地提供最先进的性能。
translated by 谷歌翻译
As opposed to traditional supervised learning, multiple-instance learning concerns the problem of classifying a bag of instances, given bags that are labeled by a teacher as being overall positive or negative. Current research mainly concentrates on adapting traditional concept learning to solve this problem. In this paper we investigate the use of lazy learning and Hausdorff distance to approach the multiple-instance problem. We present two variants of the K-nearest neighbor algorithm, called Bayesian-KNN and Citation-KNN, solving the multiple-instance problem. Experiments on the Drug discovery benchmark data show that both algorithms are competitive with the best ones conceived in the concept learning framework. Further work includes exploring of a combination of lazy and eager multiple-instance problem classifiers.
translated by 谷歌翻译
机器学习模型变得越来越复杂,以便更好地复杂功能。虽然在许多领域都富有成效,但增加的复杂性是以模型可解释性为代价的。曾经流行的近邻(kNN)方法发现并使用最相似的推理数据,近几十年来由于与其他技术相比存在大量问题而受到的关注要少得多。我们证明了kNN的许多这些历史问题都可以克服,我们的贡献不仅适用于机器学习,还适用于在线学习,数据合成,异常检测,模型压缩和强化学习,而不会牺牲可解释性。我们介绍了kNN和信息理论之间的综合,我们希望能够为天生可解释和可审计的模型提供一条清晰的路径。通过这项工作,我们希望能够将kNN与信息理论相结合,作为全面审计机器学习和人工智能的有希望的途径。
translated by 谷歌翻译
This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification,
translated by 谷歌翻译
Feature selection has been widely used in data mining and machine learning tasks to make a model with a small number of features which improves the classifier's accuracy. In this paper, a novel hybrid feature selection algorithm based on particle swarm optimization is proposed. The proposed method called HPSO-LS uses a local search strategy which is embedded in the particle swarm optimization to select the less correlated and salient feature subset. The goal of the local search technique is to guide the search process of the particle swarm optimization to select distinct features by considering their correlation information. Moreover, the proposed method utilizes a subset size determination scheme to select a subset of features with reduced size. The performance of the proposed method has been evaluated on 13 benchmark classification problems and compared with five state-of-the-art feature selection methods. Moreover, HPSO-LS has been compared with four well-known filter-based methods including information gain, term variance, fisher score and mRMR and five well-known wrapper-based methods including genetic algorithm, particle swarm optimization, simulated annealing and ant colony optimization. The results demonstrated that the proposed method improves the classification accuracy compared with those of the filter based and wrapper-based feature selection methods. Furthermore, several performed statistical tests show that the proposed method's superiority over the other methods is statistically significant.
translated by 谷歌翻译
One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.
translated by 谷歌翻译
Keywords: Microarray data Correlation-based feature selection Taguchi-binary particle swarm optimization K-nearest neighbor a b s t r a c t The purpose of gene expression analysis is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification. Microarray data with reference to gene expression profiles have provided some valuable results related to a variety of problems and contributed to advances in clinical medicine. Microarray data characteristically have a high dimension and a small sample size. This makes it difficult for a general classification method to obtain correct data for classification. However, not every gene is potentially relevant for distinguishing the sample class. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process, and an effective gene extraction method is necessary for eliminating irrelevant genes and decreasing the classification error rate. In this paper, correlation-based feature selection (CFS) and the Taguchi chaotic binary particle swarm optimization (TCBPSO) were combined into a hybrid method. The K-nearest neighbor (K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for ten gene expression profiles. Experimental results show that this hybrid method effectively simplifies features selection by reducing the number of features needed. The classification error rate obtained by the proposed method had the lowest classification error rate for all of the ten gene expression data set problems tested. For six of the gene expression profile data sets a classification error rate of zero could be reached. The introduced method outperformed five other methods from the literature in terms of classification error rate. It could thus constitute a valuable tool for gene expression analysis in future studies.
translated by 谷歌翻译
In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dis-sertation. We propose a new ensemble learning framework-Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle-active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible. Lastly, we apply the proposed learning methods to a real-world bioinformatics problem-protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem. iv DEDICATION To my parents, To my grandfather, who cherished me since a kid, To my grandmother, who loved me the most, but I don't know her name, To my love. v ACKNOWLEDGEMENTS
translated by 谷歌翻译
k-最近邻方法基于其邻域中包含的信息对查询样本执行分类任务。先前对k-最近邻算法的研究通常通过结合邻域中每个样本的支持来实现aclass的决策值。他们通常分别考虑最近邻居,并且可能丢失对分类重要的整体邻居信息,例如,分发信息。本文提出了一种新的局部学习方法,通过局部分布来组织邻域信息。在该方法中,对邻域中的附加分布信息进行估计和组织;分类决策基于最大后验概率,该最大后验概率是根据邻域中的局部分布估计的。此外,基于局部分布,我们生成广义的局部分类形式,通过调整参数可以有效地应用于各种数据集。我们使用合成和实时数据集来评估所提方法的分类性能;实验结果证明了该方法的尺寸可扩展性,效率,有效性和鲁棒性,与其他一些最先进的分类器相比。结果表明,该方法在广泛的领域中是有效的和有希望的。
translated by 谷歌翻译
In this paper we study random forests through their connection with a new framework of adaptive nearest neighbor methods. We first introduce a concept of potential nearest neighbors (k-PNN's) and show that random forests can be seen as adaptively weighted k-PNN methods. Various aspects of random forests are then studied from this perspective. We investigate the effect of terminal node sizes and splitting schemes on the performance of random forests. It has been commonly believed that random forests work best using largest trees possible. We derive a lower bound to the rate of the mean squared error of regression random forests with non-adaptive splitting schemes and show that, asymptotically, growing largest trees in such random forests is not optimal. However, it may take a very large sample size for this asymptotic result to kick in for high dimensional problems. We illustrate with simulations the effect of terminal node sizes on the prediction accuracy of random forests with other splitting schemes. In general, it is advantageous to tune the terminal node size for best performance of random forests. We further show that random forests with adaptive splitting schemes assign weights to k-PNN's in a desirable way: for the estimation at a given target point, these random forests assign voting weights to the k-PNN's of the target point according to the local importance of different input variables. We propose a new simple splitting scheme that achieves desirable adaptivity in a straightforward fashion. This simple scheme can be combined with existing algorithms. The resulting algorithm is computationally faster, and gives comparable results. Other possible aspects of random forests, such as using linear combinations in splitting, are also discussed. Simulations and real datasets are used to illustrate the results.
translated by 谷歌翻译
Instance-based learning algorithms are often faced with the problem of deciding which instances to store for use during generalization. Storing too many instances can result in large memory requirements and slow execution speed, and can cause an oversensitivity to noise. This paper has two main purposes. First, it provides a survey of existing algorithms used to reduce storage requirements in instance-based learning algorithms and other exemplar-based algorithms. Second, it proposes six additional reduction algorithms called DROP1-DROP5 and DEL (three of which were first described in Wilson & Martinez, 1997c, as RT1-RT3) that can be used to remove instances from the concept description. These algorithms and 10 algorithms from the survey are compared on 31 classification tasks. Of those algorithms that provide substantial storage reduction, the DROP algorithms have the highest average generalization accuracy in these experiments, especially in the presence of uniform class noise.
translated by 谷歌翻译
This paper offers a survey of recent work on particle swarm classification (PSC), a promising offshoot of particle swarm optimization (PSO), with the goal of positioning it in the overall classification domain. The richness of the related literature shows that this new classification approach may be an efficient alternative, in addition to existing paradigms. After describing the various PSC approaches found in the literature, the paper identifies and discusses two data-related problems that may affect PSC efficiency: high-dimensional datasets and mixed-attribute data. The solutions that have been proposed in the literature for each of these issues are described including recent improvements by a novel PSC algorithm developed by the authors. Subsequently, a positioning PSC for these problems with respect to other classification approaches is made. This is accomplished by using one proprietary and five well known benchmark datasets to determine the performances of PSC algorithm and comparing the obtained results with those reported for various other classification approaches. It is concluded that PSC can be efficiently applied to classification problems with large numbers of instances, both in continuous and mixed-attribute problem description spaces. Moreover, the obtained results show that PSC may not only be applied to more demanding problem domains, but it can also be a competitive alternative to well established classification techniques.
translated by 谷歌翻译
Training classifiers with datasets which suffer of imbalanced class distributions is an important problem in data mining. This issue occurs when the number of examples representing the class of interest is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. We shortly review the many issues in machine learning and applications of this problem, by introducing the characteristics of the imbalanced dataset scenario in classification, presenting the specific metrics for evaluating performance in class imbalanced learning and enumerating the proposed solutions. In particular, we will describe preprocessing, cost-sensitive learning and ensemble techniques, carrying out an experimental study to contrast these approaches in an intra and inter-family comparison. We will carry out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem. This will help to improve the current models with respect to: the presence of small disjuncts, the lack of density in the training data, the overlapping between classes, the identification of noisy data, the significance of the borderline instances, and the dataset shift between the training and the test distributions. Finally, we introduce several approaches and recommendations to address these problems in conjunction with imbalanced data, and we will show some experimental examples on the behavior of the learning algorithms on data with such intrinsic characteristics.
translated by 谷歌翻译
When selecting the best suited algorithm for an unknown optimization problem, it is useful to possess some a priori knowledge of the problem at hand. In the context of single-objective, continuous optimization problems such knowledge can be retrieved by means of Exploratory Landscape Analysis (ELA), which automatically identifies properties of a landscape, e.g., the so-called funnel structures, based on an initial sample. In this paper, we extract the relevant features (for detecting funnels) out of a large set of landscape features when only given a small initial sample consisting of 50 × D observations, where D is the number of decision space dimensions. This is already in the range of the start population sizes of many evolutionary algorithms. The new Multiple Peaks Model Generator (MPM2) is used for training the classifier, and the approach is then very successfully validated on the Black-Box Optimization Benchmark (BBOB) and a subset of the CEC 2013 niching competition problems.
translated by 谷歌翻译