Supervised deep learning is most commonly applied to difficult problems defined on large and often extensively curated datasets. Here we demonstrate the ability of deep representation learning to address problems of classification and regression from small and poorly formed tabular datasets by encoding input information as abstracted sequences composed of a fixed number of characters per input field. We find that small models have sufficient capacity for approximation of various functions and achieve record classification benchmark accuracy. Such models are shown to form useful embeddings of various input features in their hidden layers, even if the learned task does not explicitly require knowledge of those features. These models are also amenable to input attribution, allowing for an estimation of the importance of each input element to the model output as well as of which inputs features are effectively embedded in the model. We present a proof-of-concept for the application of small language models to mixed tabular data without explicit feature engineering, cleaning, or preprocessing, relying on the model to perform these tasks as part of the representation learning process.
translated by 谷歌翻译
Deep learning models develop successive representations of their input in sequential layers, the last of which maps the final representation to the output. Here we investigate the informational content of these representations by observing the ability of convolutional image classification models to autoencode the model's input using embeddings existing in various layers. We find that the deeper the layer, the less accurate that layer's representation of the input is before training. Inaccurate representation results from non-uniqueness in which various distinct inputs give approximately the same embedding. Non-unique representation is a consequence of both exact and approximate non-invertibility of transformations present in the forward pass. Learning to classify natural images leads to an increase in representation clarity for early but not late layers, which instead form abstract images. Rather than simply selecting for features present in the input necessary for classification, deep layer representations are found to transform the input so that it matches representations of the training data such that arbitrary inputs are mapped to manifolds learned during training. This work provides support for the theory that the tasks of image recognition and input generation are inseparable even for models trained exclusively to classify.
translated by 谷歌翻译
Very large deep learning models trained using gradient descent are remarkably resistant to memorization given their huge capacity, but are at the same time capable of fitting large datasets of pure noise. Here methods are introduced by which models may be trained to memorize datasets that normally are generalized. We find that memorization is difficult relative to generalization, but that adding noise makes memorization easier. Increasing the dataset size exaggerates the characteristics of that dataset: model access to more training samples makes overfitting easier for random data, but somewhat harder for natural images. The bias of deep learning towards generalization is explored theoretically, and we show that generalization results from a model's parameters being attracted to points of maximal stability with respect to that model's inputs during gradient descent.
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译
本教程展示了工作流程,将文本数据纳入精算分类和回归任务。主要重点是采用基于变压器模型的方法。平均长度为400个单词的车祸描述的数据集,英语和德语可用,以及具有简短财产保险索赔的数据集用来证明这些技术。案例研究应对与多语言环境和长输入序列有关的挑战。他们还展示了解释模型输出,评估和改善模型性能的方法,通过将模型调整到应用程序领域或特定预测任务。最后,该教程提供了在没有或仅有少数标记数据的情况下处理分类任务的实用方法。通过使用最少的预处理和微调的现成自然语言处理(NLP)模型的语言理解技能(NLP)模型实现的结果清楚地证明了用于实际应用的转移学习能力。
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译
基于签名的技术使数学洞察力洞悉不断发展的数据的复杂流之间的相互作用。这些见解可以自然地转化为理解流数据的数值方法,也许是由于它们的数学精度,已被证明在数据不规则而不是固定的情况下分析流的数据以及数据和数据的尺寸很有用样本量均为中等。了解流的多模式数据是指数的:$ d $ d $的字母中的$ n $字母中的一个单词可以是$ d^n $消息之一。签名消除了通过采样不规则性引起的指数级噪声,但仍然存在指数量的信息。这项调查旨在留在可以直接管理指数缩放的域中。在许多问题中,可伸缩性问题是一个重要的挑战,但需要另一篇调查文章和进一步的想法。这项调查描述了一系列环境集足够小以消除大规模机器学习的可能性,并且可以有效地使用一小部分免费上下文和原则性功能。工具的数学性质可以使他们对非数学家的使用恐吓。本文中介绍的示例旨在弥合此通信差距,并提供从机器学习环境中绘制的可进行的工作示例。笔记本可以在线提供这些示例中的一些。这项调查是基于伊利亚·雪佛兰(Ilya Chevryev)和安德烈·科米利津(Andrey Kormilitzin)的早期论文,它们在这种机械开发的较早时刻大致相似。本文说明了签名提供的理论见解是如何在对应用程序数据的分析中简单地实现的,这种方式在很大程度上对数据类型不可知。
translated by 谷歌翻译
我们考虑使用自动监督学习系统的数据表,不仅包含数字/分类列,而且还包含一个或多个文本字段。在这里,我们组装了18个多模式数据表,每个数据表都包含一些文本字段并源于真正的业务应用程序。我们的公开的基准使研究人员能够通过数字,分类和文本功能全面评估自己的监督学习方法。为了确保在所有18个数据集上执行良好的任何单一建模策略将作为多式化文本/表格自动机的实用基础,我们的基准中的不同数据集在:样本大小,问题类型(分类和回归任务组合),功能数量(数据集之间的文本列的数量范围为1到28),以及预测信号如何在文本与数字/分类特征(以及预测相互作用)之间分解。在此基准测试中,我们评估各种直接的流水线来模拟这些数据,包括标准的两阶段方法,其中NLP用于团体化文本,然后可以应用表格数据的自动机。与人类数据科学团队相比,在我们的基准测试(堆叠与各种树模型的堆栈组合多峰变压器的堆栈)的全自动方法也可以在两个机器预测竞赛中符合原始文本/表格数据和第二次在卡格的Mercari价格建议挑战中的地方(2380支球队)。
translated by 谷歌翻译
Future surveys such as the Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory will observe an order of magnitude more astrophysical transient events than any previous survey before. With this deluge of photometric data, it will be impossible for all such events to be classified by humans alone. Recent efforts have sought to leverage machine learning methods to tackle the challenge of astronomical transient classification, with ever improving success. Transformers are a recently developed deep learning architecture, first proposed for natural language processing, that have shown a great deal of recent success. In this work we develop a new transformer architecture, which uses multi-head self attention at its core, for general multi-variate time-series data. Furthermore, the proposed time-series transformer architecture supports the inclusion of an arbitrary number of additional features, while also offering interpretability. We apply the time-series transformer to the task of photometric classification, minimising the reliance of expert domain knowledge for feature selection, while achieving results comparable to state-of-the-art photometric classification methods. We achieve a logarithmic-loss of 0.507 on imbalanced data in a representative setting using data from the Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC). Moreover, we achieve a micro-averaged receiver operating characteristic area under curve of 0.98 and micro-averaged precision-recall area under curve of 0.87.
translated by 谷歌翻译
我用Hunglish2语料库训练神经电脑翻译任务的模型。这项工作的主要贡献在培训NMT模型期间评估不同的数据增强方法。我提出了5种不同的增强方法,这些方法是结构感知的,这意味着而不是随机选择用于消隐或替换的单词,句子的依赖树用作增强的基础。我首先关于神经网络的详细文献综述,顺序建模,神经机翻译,依赖解析和数据增强。经过详细的探索性数据分析和Hunglish2语料库的预处理之后,我使用所提出的数据增强技术进行实验。匈牙利语的最佳型号达到了33.9的BLEU得分,而英国匈牙利最好的模型达到了28.6的BLEU得分。
translated by 谷歌翻译
异构表格数据是最常用的数据形式,对于众多关键和计算要求的应用程序至关重要。在同质数据集上,深度神经网络反复显示出卓越的性能,因此被广泛采用。但是,它们适应了推理或数据生成任务的表格数据仍然具有挑战性。为了促进该领域的进一步进展,这项工作概述了表格数据的最新深度学习方法。我们将这些方法分为三组:数据转换,专业体系结构和正则化模型。对于每个小组,我们的工作提供了主要方法的全面概述。此外,我们讨论了生成表格数据的深度学习方法,并且还提供了有关解释对表格数据的深层模型的策略的概述。因此,我们的第一个贡献是解决上述领域中的主要研究流和现有方法,同时强调相关的挑战和开放研究问题。我们的第二个贡献是在传统的机器学习方法中提供经验比较,并在五个流行的现实世界中的十种深度学习方法中,具有不同规模和不同的学习目标的经验比较。我们已将作为竞争性基准公开提供的结果表明,基于梯度增强的树合奏的算法仍然大多在监督学习任务上超过了深度学习模型,这表明对表格数据的竞争性深度学习模型的研究进度停滞不前。据我们所知,这是对表格数据深度学习方法的第一个深入概述。因此,这项工作可以成为有价值的起点,以指导对使用表格数据深入学习感兴趣的研究人员和从业人员。
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译
Pre-publication draft of a book to be published byMorgan & Claypool publishers. Unedited version released with permission. All relevant copyrights held by the author and publisher extend to this pre-publication draft.
translated by 谷歌翻译
Multilayer Neural Networks trained with the backpropagation algorithm constitute the best example of a successful Gradient-Based Learning technique. Given an appropriate network architecture, Gradient-Based Learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional Neural Networks, that are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques.Real-life document recognition systems are composed of multiple modules including eld extraction, segmentation, recognition, and language modeling. A new learning paradigm, called Graph Transformer Networks (GTN), allows such multi-module systems to be trained globally using Gradient-Based methods so as to minimize an overall performance measure.Two systems for on-line handwriting recognition are described. Experiments demonstrate the advantage of global training, and the exibility of Graph Transformer Networks.A Graph Transformer Network for reading bank check is also described. It uses Convolutional Neural Network character recognizers combined with global training techniques to provides record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
translated by 谷歌翻译
可解释的人工智能(XAI)的新兴领域旨在为当今强大但不透明的深度学习模型带来透明度。尽管本地XAI方法以归因图的形式解释了个体预测,从而确定了重要特征的发生位置(但没有提供有关其代表的信息),但全局解释技术可视化模型通常学会的编码的概念。因此,两种方法仅提供部分见解,并留下将模型推理解释的负担。只有少数当代技术旨在将本地和全球XAI背后的原则结合起来,以获取更多信息的解释。但是,这些方法通常仅限于特定的模型体系结构,或对培训制度或数据和标签可用性施加其他要求,这实际上使事后应用程序成为任意预训练的模型。在这项工作中,我们介绍了概念相关性传播方法(CRP)方法,该方法结合了XAI的本地和全球观点,因此允许回答“何处”和“ where”和“什么”问题,而没有其他约束。我们进一步介绍了相关性最大化的原则,以根据模型对模型的有用性找到代表性的示例。因此,我们提高了对激活最大化及其局限性的共同实践的依赖。我们证明了我们方法在各种环境中的能力,展示了概念相关性传播和相关性最大化导致了更加可解释的解释,并通过概念图表,概念组成分析和概念集合和概念子区和概念子区和概念子集和定量研究对模型的表示和推理提供了深刻的见解。它们在细粒度决策中的作用。
translated by 谷歌翻译
基因组工程正在进行前所未有的发展,现在已广泛可用。为确保负责任的生物技术创新并减少滥用工程DNA序列,为识别工程型质粒的起源实验室来说是至关重要的。基因工程归因(GEA),制定序列实验室协会的能力将支持这一过程中的法医专家。在这里,我们提出了一种基于度量学习的方法,该方法将最可能的原产实验室排名,同时为质粒序列和实验室产生嵌入。这些嵌入物可用于执行各种下游任务,例如聚类DNA序列和实验室,以及在机器学习模型中使用它们作为特征。我们的方法采用了循环转移增强方法,能够在前10个预测中正确地将原产于原产的90亿美元的时间排列 - 优于所有最新的最先进的方法。我们还证明我们可以使用只需10次\%$ 10 \%$ of序列进行几次拍摄学习并获得76±10美元的准确性。这意味着,我们仅使用第十个数据表达先前的CNN方法。我们还证明我们能够在特定实验室中提取质粒序列中的关键签名,允许对模型的产出进行可解释的检查。
translated by 谷歌翻译
基于变压器模型架构的最近深入学习研究在各种域和任务中展示了最先进的性能,主要是在计算机视觉和自然语言处理域中。虽然最近的一些研究已经实施了使用电子健康记录数据的临床任务的变压器,但它们的范围,灵活性和全面性有限。在本研究中,我们提出了一种灵活的基于变换器的EHR嵌入管道和预测模型框架,它引入了利用了医疗域唯一的数据属性的现有工作流程的几个新颖修改。我们展示了灵活设计的可行性,在重症监护病房的案例研究中,我们的模型准确地预测了七种临床结果,这些临床结果与多个未来的时间范围有关的入院和患者死亡率。
translated by 谷歌翻译
即使机器学习算法已经在数据科学中发挥了重要作用,但许多当前方法对输入数据提出了不现实的假设。由于不兼容的数据格式,或数据集中的异质,分层或完全缺少的数据片段,因此很难应用此类方法。作为解决方案,我们提出了一个用于样本表示,模型定义和培训的多功能,统一的框架,称为“ Hmill”。我们深入审查框架构建和扩展的机器学习的多个范围范式。从理论上讲,为HMILL的关键组件的设计合理,我们将通用近似定理的扩展显示到框架中实现的模型所实现的所有功能的集合。本文还包含有关我们实施中技术和绩效改进的详细讨论,该讨论将在MIT许可下发布供下载。该框架的主要资产是其灵活性,它可以通过相同的工具对不同的现实世界数据源进行建模。除了单独观察到每个对象的一组属性的标准设置外,我们解释了如何在框架中实现表示整个对象系统的图表中的消息推断。为了支持我们的主张,我们使用框架解决了网络安全域的三个不同问题。第一种用例涉及来自原始网络观察结果的IoT设备识别。在第二个问题中,我们研究了如何使用以有向图表示的操作系统的快照可以对恶意二进制文件进行分类。最后提供的示例是通过网络中实体之间建模域黑名单扩展的任务。在所有三个问题中,基于建议的框架的解决方案可实现与专业方法相当的性能。
translated by 谷歌翻译
In recent years, deep learning has infiltrated every field it has touched, reducing the need for specialist knowledge and automating the process of knowledge discovery from data. This review argues that astronomy is no different, and that we are currently in the midst of a deep learning revolution that is transforming the way we do astronomy. We trace the history of astronomical connectionism from the early days of multilayer perceptrons, through the second wave of convolutional and recurrent neural networks, to the current third wave of self-supervised and unsupervised deep learning. We then predict that we will soon enter a fourth wave of astronomical connectionism, in which finetuned versions of an all-encompassing 'foundation' model will replace expertly crafted deep learning models. We argue that such a model can only be brought about through a symbiotic relationship between astronomy and connectionism, whereby astronomy provides high quality multimodal data to train the foundation model, and in turn the foundation model is used to advance astronomical research.
translated by 谷歌翻译
高维计算(HDC)是用于数据表示和学习的范式,起源于计算神经科学。HDC将数据表示为高维,低精度向量,可用于学习或召回等各种信息处理任务。高维空间的映射是HDC中的一个基本问题,现有方法在输入数据本身是高维时会遇到可伸缩性问题。在这项工作中,我们探索了一个基于哈希的流媒体编码技术。我们正式表明,这些方法在学习应用程序的性能方面具有可比的保证,同时比现有替代方案更有效。我们在一个流行的高维分类问题上对这些结果进行了实验验证,并表明我们的方法很容易扩展到非常大的数据集。
translated by 谷歌翻译