Machine learning and deep learning classification models are data-driven, and the model and the data jointly determine their classification performance. It is biased to evaluate the model's performance only based on the classifier accuracy while ignoring the data separability. Sometimes, the model exhibits excellent accuracy, which might be attributed to its testing on highly separable data. Most of the current studies on data separability measures are defined based on the distance between sample points, but this has been demonstrated to fail in several circumstances. In this paper, we propose a new separability measure--the rate of separability (RS), which is based on the data coding rate. We validate its effectiveness as a supplement to the separability measure by comparing it to four other distance-based measures on synthetic datasets. Then, we demonstrate the positive correlation between the proposed measure and recognition accuracy in a multi-task scenario constructed from a real dataset. Finally, we discuss the methods for evaluating the classification performance of machine learning and deep learning models considering data separability.
translated by 谷歌翻译
由于更高的维度和困难的班级,机器学习应用中的可用数据变得越来越复杂。根据类重叠,可分离或边界形状,以及组形态,存在各种各样的方法来测量标记数据的复杂性。许多技术可以转换数据才能找到更好的功能,但很少专注于具体降低数据复杂性。大多数数据转换方法主要是治疗维度方面,撇开类标签中的可用信息,当类别在某种方式复杂时,可以有用。本文提出了一种基于AutoEncoder的复杂性减少方法,使用类标签来告知损耗函数关于所生成的变量的充分性。这导致了三个不同的新功能学习者,得分手,斯卡尔和切片机。它们基于Fisher的判别比率,Kullback-Leibler发散和最小二乘支持向量机。它们可以作为二进制分类问题应用作为预处理阶段。跨越27个数据集和一系列复杂性和分类指标的彻底实验表明,课堂上通知的AutoEncoders执行优于4个其他流行的无监督功能提取技术,特别是当最终目标使用数据进行分类任务时。
translated by 谷歌翻译
公制学习旨在学习一个距离度量,以便在将不同的实例推开时将语义上相似的实例放在一起。许多现有方法考虑在特征空间中最大化或至少限制距离距离的距离,以分离相似和不同的实例对以保证其概括能力。在本文中,我们主张在输入空间中施加对抗边缘,以改善公制学习算法的概括和稳健性。我们首先表明,对抗边缘定义为训练实例与其最接近的对手示例之间的距离,它既考虑了特征空间中的距离差距以及指标和三重限制之间的相关性。接下来,为了增强实例扰动的鲁棒性,我们建议通过最大程度地减少称为扰动损失的新型损失函数来扩大对抗缘。提出的损失可以看作是数据依赖性的正规器,并轻松地插入任何现有的度量学习方法中。最后,我们表明扩大边缘通过使用算法鲁棒性的理论技术对概括能力有益。 16个数据集的实验结果证明了所提出的方法比现有的最新方法具有歧视精度和鲁棒性,以抵抗可能的噪声。
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译
当前,随机平滑被认为是获得确切可靠分类器的最新方法。尽管其表现出色,但该方法仍与各种严重问题有关,例如``认证准确性瀑布'',认证与准确性权衡甚至公平性问题。已经提出了依赖输入的平滑方法,目的是克服这些缺陷。但是,我们证明了这些方法缺乏正式的保证,因此所产生的证书是没有道理的。我们表明,一般而言,输入依赖性平滑度遭受了维数的诅咒,迫使方差函数具有低半弹性。另一方面,我们提供了一个理论和实用的框架,即使在严格的限制下,即使在有维度的诅咒的情况下,即使在存在维度的诅咒的情况下,也可以使用依赖输入的平滑。我们提供平滑方差功能的一种混凝土设计,并在CIFAR10和MNIST上进行测试。我们的设计减轻了经典平滑的一些问题,并正式下划线,但仍需要进一步改进设计。
translated by 谷歌翻译
大量的数据和创新算法使数据驱动的建模成为现代行业的流行技术。在各种数据驱动方法中,潜在变量模型(LVM)及其对应物占主要份额,并在许多工业建模领域中起着至关重要的作用。 LVM通常可以分为基于统计学习的经典LVM和基于神经网络的深层LVM(DLVM)。我们首先讨论经典LVM的定义,理论和应用,该定义和应用既是综合教程,又是对经典LVM的简短申请调查。然后,我们对当前主流DLVM进行了彻底的介绍,重点是其理论和模型体系结构,此后不久就提供了有关DLVM的工业应用的详细调查。上述两种类型的LVM具有明显的优势和缺点。具体而言,经典的LVM具有简洁的原理和良好的解释性,但是它们的模型能力无法解决复杂的任务。基于神经网络的DLVM具有足够的模型能力,可以在复杂的场景中实现令人满意的性能,但它以模型的解释性和效率为例。旨在结合美德并减轻这两种类型的LVM的缺点,并探索非神经网络的举止以建立深层模型,我们提出了一个新颖的概念,称为“轻量级Deep LVM(LDLVM)”。在提出了这个新想法之后,该文章首先阐述了LDLVM的动机和内涵,然后提供了两个新颖的LDLVM,并详尽地描述了其原理,建筑和优点。最后,讨论了前景和机会,包括重要的开放问题和可能的研究方向。
translated by 谷歌翻译
Recent years witnessed the breakthrough of face recognition with deep convolutional neural networks. Dozens of papers in the field of FR are published every year. Some of them were applied in the industrial community and played an important role in human life such as device unlock, mobile payment, and so on. This paper provides an introduction to face recognition, including its history, pipeline, algorithms based on conventional manually designed features or deep learning, mainstream training, evaluation datasets, and related applications. We have analyzed and compared state-of-the-art works as many as possible, and also carefully designed a set of experiments to find the effect of backbone size and data distribution. This survey is a material of the tutorial named The Practical Face Recognition Technology in the Industrial World in the FG2023.
translated by 谷歌翻译
越来越多的工作已经认识到利用机器学习(ML)进步的重要性,以满足提取访问控制属性,策略挖掘,策略验证,访问决策等有效自动化的需求。在这项工作中,我们调查和总结了各种ML解决不同访问控制问题的方法。我们提出了ML模型在访问控制域中应用的新分类学。我们重点介绍当前的局限性和公开挑战,例如缺乏公共现实世界数据集,基于ML的访问控制系统的管理,了解黑盒ML模型的决策等,并列举未来的研究方向。
translated by 谷歌翻译
最近有一项激烈的活动在嵌入非常高维和非线性数据结构的嵌入中,其中大部分在数据科学和机器学习文献中。我们分四部分调查这项活动。在第一部分中,我们涵盖了非线性方法,例如主曲线,多维缩放,局部线性方法,ISOMAP,基于图形的方法和扩散映射,基于内核的方法和随机投影。第二部分与拓扑嵌入方法有关,特别是将拓扑特性映射到持久图和映射器算法中。具有巨大增长的另一种类型的数据集是非常高维网络数据。第三部分中考虑的任务是如何将此类数据嵌入中等维度的向量空间中,以使数据适合传统技术,例如群集和分类技术。可以说,这是算法机器学习方法与统计建模(所谓的随机块建模)之间的对比度。在论文中,我们讨论了两种方法的利弊。调查的最后一部分涉及嵌入$ \ mathbb {r}^ 2 $,即可视化中。提出了三种方法:基于第一部分,第二和第三部分中的方法,$ t $ -sne,UMAP和大节。在两个模拟数据集上进行了说明和比较。一个由嘈杂的ranunculoid曲线组成的三胞胎,另一个由随机块模型和两种类型的节点产生的复杂性的网络组成。
translated by 谷歌翻译
本文介绍了分类器校准原理和实践的简介和详细概述。校准的分类器正确地量化了与其实例明智的预测相关的不确定性或信心水平。这对于关键应用,最佳决策,成本敏感的分类以及某些类型的上下文变化至关重要。校准研究具有丰富的历史,其中几十年来预测机器学习作为学术领域的诞生。然而,校准兴趣的最近增加导致了新的方法和从二进制到多种子体设置的扩展。需要考虑的选项和问题的空间很大,并导航它需要正确的概念和工具集。我们提供了主要概念和方法的介绍性材料和最新的技术细节,包括适当的评分规则和其他评估指标,可视化方法,全面陈述二进制和多字数分类的HOC校准方法,以及几个先进的话题。
translated by 谷歌翻译
We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task? And how can clustering results be validated? Connectivity-based versus prototype-based approaches are reflected in the context of several popular methods: single-linkage, spectral embedding, k-means, and Gaussian mixtures are discussed as well as the density-based protocols (H)DBSCAN, Jarvis-Patrick, CommonNN, and density-peaks.
translated by 谷歌翻译
Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to wellinformed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.
translated by 谷歌翻译
Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified.
translated by 谷歌翻译
现代高维方法经常采用“休稀稀物”的原则,而在监督多元学习统计学中可能面临着大量非零系数的“密集”问题。本文提出了一种新的聚类减少秩(CRL)框架,其施加了两个联合矩阵规范化,以自动分组构建预测因素的特征。 CRL比低级别建模更具可解释,并放松变量选择中的严格稀疏假设。在本文中,提出了新的信息 - 理论限制,揭示了寻求集群的内在成本,以及多元学习中的维度的祝福。此外,开发了一种有效的优化算法,其执行子空间学习和具有保证融合的聚类。所获得的定点估计器虽然不一定是全局最佳的,但在某些规则条件下享有超出标准似然设置的所需的统计准确性。此外,提出了一种新的信息标准,以及其无垢形式,用于集群和秩选择,并且具有严格的理论支持,而不假设无限的样本大小。广泛的模拟和实数据实验证明了所提出的方法的统计准确性和可解释性。
translated by 谷歌翻译
Accurate uncertainty quantification is a major challenge in deep learning, as neural networks can make overconfident errors and assign high confidence predictions to out-of-distribution (OOD) inputs. The most popular approaches to estimate predictive uncertainty in deep learning are methods that combine predictions from multiple neural networks, such as Bayesian neural networks (BNNs) and deep ensembles. However their practicality in real-time, industrial-scale applications are limited due to the high memory and computational cost. Furthermore, ensembles and BNNs do not necessarily fix all the issues with the underlying member networks. In this work, we study principled approaches to improve uncertainty property of a single network, based on a single, deterministic representation. By formalizing the uncertainty quantification as a minimax learning problem, we first identify distance awareness, i.e., the model's ability to quantify the distance of a testing example from the training data, as a necessary condition for a DNN to achieve high-quality (i.e., minimax optimal) uncertainty estimation. We then propose Spectral-normalized Neural Gaussian Process (SNGP), a simple method that improves the distance-awareness ability of modern DNNs with two simple changes: (1) applying spectral normalization to hidden weights to enforce bi-Lipschitz smoothness in representations and (2) replacing the last output layer with a Gaussian process layer. On a suite of vision and language understanding benchmarks, SNGP outperforms other single-model approaches in prediction, calibration and out-of-domain detection. Furthermore, SNGP provides complementary benefits to popular techniques such as deep ensembles and data augmentation, making it a simple and scalable building block for probabilistic deep learning. Code is open-sourced at https://github.com/google/uncertainty-baselines
translated by 谷歌翻译
Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias.
translated by 谷歌翻译
Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.\footnote{https://github.com/amorimlb/scaling\_matters}
translated by 谷歌翻译
With the development of a series of Galaxy sky surveys in recent years, the observations increased rapidly, which makes the research of machine learning methods for galaxy image recognition a hot topic. Available automatic galaxy image recognition researches are plagued by the large differences in similarity between categories, the imbalance of data between different classes, and the discrepancy between the discrete representation of Galaxy classes and the essentially gradual changes from one morphological class to the adjacent class (DDRGC). These limitations have motivated several astronomers and machine learning experts to design projects with improved galaxy image recognition capabilities. Therefore, this paper proposes a novel learning method, ``Hierarchical Imbalanced data learning with Weighted sampling and Label smoothing" (HIWL). The HIWL consists of three key techniques respectively dealing with the above-mentioned three problems: (1) Designed a hierarchical galaxy classification model based on an efficient backbone network; (2) Utilized a weighted sampling scheme to deal with the imbalance problem; (3) Adopted a label smoothing technique to alleviate the DDRGC problem. We applied this method to galaxy photometric images from the Galaxy Zoo-The Galaxy Challenge, exploring the recognition of completely round smooth, in between smooth, cigar-shaped, edge-on and spiral. The overall classification accuracy is 96.32\%, and some superiorities of the HIWL are shown based on recall, precision, and F1-Score in comparing with some related works. In addition, we also explored the visualization of the galaxy image features and model attention to understand the foundations of the proposed scheme.
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译
聚类算法的全面基准是困难的两个关键因素:(i)〜这种无监督的学习方法的独特数学定义和(ii)〜某些聚类算法采用的生成模型或群集标准之间的依赖性的依赖性内部集群验证。因此,对严格基准测试的最佳做法没有达成共识,以及是否有可能在给定申请的背景之外。在这里,我们认为合成数据集必须继续在群集算法的评估中发挥重要作用,但这需要构建适当地涵盖影响聚类算法性能的各种属性集的基准。通过我们的框架,我们展示了重要的角色进化算法,以支持灵活的这种基准,允许简单的修改和扩展。我们说明了我们框架的两种可能用途:(i)〜基准数据的演变与一组手派生属性和(ii)〜生成梳理给定对算法之间的性能差异的数据集。我们的作品对设计集群基准的设计具有足够挑战广泛算法的集群基准,并进一步了解特定方法的优势和弱点。
translated by 谷歌翻译