Neural networks trained on large datasets by minimizing a loss have become the state-of-the-art approach for resolving data science problems, particularly in computer vision, image processing and natural language processing. In spite of their striking results, our theoretical understanding about how neural networks operate is limited. In particular, what are the interpolation capabilities of trained neural networks? In this paper we discuss a theorem of Domingos stating that "every machine learned by continuous gradient descent is approximately a kernel machine". According to Domingos, this fact leads to conclude that all machines trained on data are mere kernel machines. We first extend Domingo's result in the discrete case and to networks with vector-valued output. We then study its relevance and significance on simple examples. We find that in simple cases, the "neural tangent kernel" arising in Domingos' theorem does provide understanding of the networks' predictions. Furthermore, when the task given to the network grows in complexity, the interpolation capability of the network can be effectively explained by Domingos' theorem, and therefore is limited. We illustrate this fact on a classic perception theory problem: recovering a shape from its boundary.
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译
These notes were compiled as lecture notes for a course developed and taught at the University of the Southern California. They should be accessible to a typical engineering graduate student with a strong background in Applied Mathematics. The main objective of these notes is to introduce a student who is familiar with concepts in linear algebra and partial differential equations to select topics in deep learning. These lecture notes exploit the strong connections between deep learning algorithms and the more conventional techniques of computational physics to achieve two goals. First, they use concepts from computational physics to develop an understanding of deep learning algorithms. Not surprisingly, many concepts in deep learning can be connected to similar concepts in computational physics, and one can utilize this connection to better understand these algorithms. Second, several novel deep learning algorithms can be used to solve challenging problems in computational physics. Thus, they offer someone who is interested in modeling a physical phenomena with a complementary set of tools.
translated by 谷歌翻译
这项调查的目的是介绍对深神经网络的近似特性的解释性回顾。具体而言,我们旨在了解深神经网络如何以及为什么要优于其他经典线性和非线性近似方法。这项调查包括三章。在第1章中,我们回顾了深层网络及其组成非线性结构的关键思想和概念。我们通过在解决回归和分类问题时将其作为优化问题来形式化神经网络问题。我们简要讨论用于解决优化问题的随机梯度下降算法以及用于解决优化问题的后传播公式,并解决了与神经网络性能相关的一些问题,包括选择激活功能,成本功能,过度适应问题和正则化。在第2章中,我们将重点转移到神经网络的近似理论上。我们首先介绍多项式近似中的密度概念,尤其是研究实现连续函数的Stone-WeierStrass定理。然后,在线性近似的框架内,我们回顾了馈电网络的密度和收敛速率的一些经典结果,然后在近似Sobolev函数中进行有关深网络复杂性的最新发展。在第3章中,利用非线性近似理论,我们进一步详细介绍了深度和近似网络与其他经典非线性近似方法相比的近似优势。
translated by 谷歌翻译
In a series of recent theoretical works, it was shown that strongly overparameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to overparameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.
translated by 谷歌翻译
Building a quantum analog of classical deep neural networks represents a fundamental challenge in quantum computing. A key issue is how to address the inherent non-linearity of classical deep learning, a problem in the quantum domain due to the fact that the composition of an arbitrary number of quantum gates, consisting of a series of sequential unitary transformations, is intrinsically linear. This problem has been variously approached in the literature, principally via the introduction of measurements between layers of unitary transformations. In this paper, we introduce the Quantum Path Kernel, a formulation of quantum machine learning capable of replicating those aspects of deep machine learning typically associated with superior generalization performance in the classical domain, specifically, hierarchical feature learning. Our approach generalizes the notion of Quantum Neural Tangent Kernel, which has been used to study the dynamics of classical and quantum machine learning models. The Quantum Path Kernel exploits the parameter trajectory, i.e. the curve delineated by model parameters as they evolve during training, enabling the representation of differential layer-wise convergence behaviors, or the formation of hierarchical parametric dependencies, in terms of their manifestation in the gradient space of the predictor function. We evaluate our approach with respect to variants of the classification of Gaussian XOR mixtures - an artificial but emblematic problem that intrinsically requires multilevel learning in order to achieve optimal class separation.
translated by 谷歌翻译
Neural networks are known to be a class of highly expressive functions able to fit even random inputoutput mappings with 100% accuracy. In this work we present properties of neural networks that complement this aspect of expressivity. By using tools from Fourier analysis, we highlight a learning bias of deep networks towards low frequency functions -i.e. functions that vary globally without local fluctuations -which manifests itself as a frequency-dependent learning speed. Intuitively, this property is in line with the observation that over-parameterized networks prioritize learning simple patterns that generalize across data samples. We also investigate the role of the shape of the data manifold by presenting empirical and theoretical evidence that, somewhat counter-intuitively, learning higher frequencies gets easier with increasing manifold complexity.
translated by 谷歌翻译
我们研究了具有由完全连接的神经网络产生的密度场的固体各向同性物质惩罚(SIMP)方法,将坐标作为输入。在大的宽度限制中,我们表明DNN的使用导致滤波效果类似于SIMP的传统过滤技术,具有由神经切线内核(NTK)描述的过滤器。然而,这种过滤器在翻译下不是不变的,导致视觉伪像和非最佳形状。我们提出了两个输入坐标的嵌入,导致NTK和滤波器的空间不变性。我们经验证实了我们的理论观察和研究了过滤器大小如何受网络架构的影响。我们的解决方案可以很容易地应用于任何其他基于坐标的生成方法。
translated by 谷歌翻译
低维歧管假设认为,在许多应用中发现的数据,例如涉及自然图像的数据(大约)位于嵌入高维欧几里得空间中的低维歧管上。在这种情况下,典型的神经网络定义了一个函数,该函数在嵌入空间中以有限数量的向量作为输入。但是,通常需要考虑在训练分布以外的点上评估优化网络。本文考虑了培训数据以$ \ mathbb r^d $的线性子空间分配的情况。我们得出对由神经网络定义的学习函数变化的估计值,沿横向子空间的方向。我们研究了数据歧管的编纂中与网络的深度和噪声相关的潜在正则化效应。由于存在噪声,我们还提出了训练中的其他副作用。
translated by 谷歌翻译
最近的研究表明,通过梯度下降训练的无限宽神经网络(NN)的动态可以是神经切线核(NTK)\ CITEP {Jacot2018neural}的特征。在平方损失下,通过梯度下降训练的无限宽度NN,具有无限小的学习速率等同于与NTK \ CITEP {arora2019Exact}的内核回归。但是,当前ridge回归{arora2019Harnessing}只知道等价物,而NN和其他内核机(KMS)之间的等价,例如,支持向量机(SVM),仍然未知。因此,在这项工作中,我们建议在NN和SVM之间建立等效,具体而言,通过柔软的边缘损失和具有由子润发性培训的NTK培训的标准柔软裕度SVM培训的无限宽NN。我们的主要理论结果包括建立NN和广泛的$ \ ELL_2 $正规化KMS之间的等价,其中有限宽度界限,不能通过事先工作来处理,并显示出通过这种正规化损耗函数训练的每个有限宽度NN大约一公里。此外,我们展示了我们的理论可以实现三种实际应用,包括(i)\ yressit {非空心}通过相应的km界限Nn; (ii)无限宽度NN的\ yryit {非琐碎}鲁棒性证书(而现有的鲁棒性验证方法提供空中界定); (iii)本质上更强大的无限宽度NN,来自以前的内核回归。我们的实验代码可用于\ URL {https://github.com/leslie-ch/equiv-nn-svm}。
translated by 谷歌翻译
了解通过随机梯度下降(SGD)训练的神经网络的特性是深度学习理论的核心。在这项工作中,我们采取了平均场景,并考虑通过SGD培训的双层Relu网络,以实现一个非变量正则化回归问题。我们的主要结果是SGD偏向于简单的解决方案:在收敛时,Relu网络实现输入的分段线性图,以及“结”点的数量 - 即,Relu网络估计器的切线变化的点数 - 在两个连续的训练输入之间最多三个。特别地,随着网络的神经元的数量,通过梯度流的解决方案捕获SGD动力学,并且在收敛时,重量的分布方法接近相关的自由能量的独特最小化器,其具有GIBBS形式。我们的主要技术贡献在于分析了这一最小化器产生的估计器:我们表明其第二阶段在各地消失,除了代表“结”要点的一些特定地点。我们还提供了经验证据,即我们的理论预测的不同可能发生与数据点不同的位置的结。
translated by 谷歌翻译
一项开创性的工作[Jacot等,2018]表明,在特定参数化下训练神经网络等同于执行特定的内核方法,因为宽度延伸到无穷大。这种等效性为将有关内核方法的丰富文献结果应用于神经网的结果开辟了一个有希望的方向,而神经网络很难解决。本调查涵盖了内核融合的关键结果,因为宽度进入无穷大,有限宽度校正,应用以及对相应方法的局限性的讨论。
translated by 谷歌翻译
神经切线内核(NTK)是分析神经网络及其泛化界限的训练动力学的强大工具。关于NTK的研究已致力于典型的神经网络体系结构,但对于Hadamard产品(NNS-HP)的神经网络不完整,例如StyleGAN和多项式神经网络。在这项工作中,我们为特殊类别的NNS-HP(即多项式神经网络)得出了有限宽度的NTK公式。我们证明了它们与关联的NTK与内核回归预测变量的等效性,该预测扩大了NTK的应用范围。根据我们的结果,我们阐明了针对外推和光谱偏置,PNN在标准神经网络上的分离。我们的两个关键见解是,与标准神经网络相比,PNN能够在外推方案中拟合更复杂的功能,并承认相应NTK的特征值衰减较慢。此外,我们的理论结果可以扩展到其他类型的NNS-HP,从而扩大了我们工作的范围。我们的经验结果验证了更广泛的NNS-HP类别的分离,这为对神经体系结构有了更深入的理解提供了良好的理由。
translated by 谷歌翻译
本文通过引入几何深度学习(GDL)框架来构建通用馈电型型模型与可区分的流形几何形状兼容的通用馈电型模型,从而解决了对非欧国人数据进行处理的需求。我们表明,我们的GDL模型可以在受控最大直径的紧凑型组上均匀地近似任何连续目标函数。我们在近似GDL模型的深度上获得了最大直径和上限的曲率依赖性下限。相反,我们发现任何两个非分类紧凑型歧管之间始终都有连续的函数,任何“局部定义”的GDL模型都不能均匀地近似。我们的最后一个主要结果确定了数据依赖性条件,确保实施我们近似的GDL模型破坏了“维度的诅咒”。我们发现,任何“现实世界”(即有限)数据集始终满足我们的状况,相反,如果目标函数平滑,则任何数据集都满足我们的要求。作为应用,我们确认了以下GDL模型的通用近似功能:Ganea等。 (2018)的双波利馈电网络,实施Krishnan等人的体系结构。 (2015年)的深卡尔曼 - 滤波器和深度玛克斯分类器。我们构建了:Meyer等人的SPD-Matrix回归剂的通用扩展/变体。 (2011)和Fletcher(2003)的Procrustean回归剂。在欧几里得的环境中,我们的结果暗示了Kidger和Lyons(2020)的近似定理和Yarotsky和Zhevnerchuk(2019)无估计近似率的数据依赖性版本的定量版本。
translated by 谷歌翻译
本文提出了一个无网格的计算框架和机器学习理论,用于在未知的歧管上求解椭圆形PDE,并根据扩散地图(DM)和深度学习确定点云。 PDE求解器是作为监督的学习任务制定的,以解决最小二乘回归问题,该问题施加了近似PDE的代数方程(如果适用)。该代数方程涉及通过DM渐近扩展获得的图形拉平型矩阵,该基质是二阶椭圆差差算子的一致估计器。最终的数值方法是解决受神经网络假设空间解决方案的高度非凸经验最小化问题。在体积良好的椭圆PDE设置中,当假设空间由具有无限宽度或深度的神经网络组成时,我们表明,经验损失函数的全球最小化器是大型训练数据极限的一致解决方案。当假设空间是一个两层神经网络时,我们表明,对于足够大的宽度,梯度下降可以识别经验损失函数的全局最小化器。支持数值示例证明了解决方案的收敛性,范围从具有低和高共限度的简单歧管到具有和没有边界的粗糙表面。我们还表明,所提出的NN求解器可以在具有概括性误差的新数据点上稳健地概括PDE解决方案,这些误差几乎与训练错误相同,从而取代了基于Nystrom的插值方法。
translated by 谷歌翻译
我们证明了由例如He等人提出的广泛使用的方法。(2015年)并使用梯度下降对最小二乘损失进行训练并不普遍。具体而言,我们描述了一大批一维数据生成分布,较高的概率下降只会发现优化景观的局部最小值不好,因为它无法将其偏离偏差远离其初始化,以零移动。。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也基本执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
translated by 谷歌翻译
这篇综述的目的是将读者介绍到图表内,以将其应用于化学信息学中的分类问题。图内核是使我们能够推断分子的化学特性的功能,可以帮助您完成诸如寻找适合药物设计的化合物等任务。内核方法的使用只是一种特殊的两种方式量化了图之间的相似性。我们将讨论限制在这种方法上,尽管近年来已经出现了流行的替代方法,但最著名的是图形神经网络。
translated by 谷歌翻译
Multilayer Neural Networks trained with the backpropagation algorithm constitute the best example of a successful Gradient-Based Learning technique. Given an appropriate network architecture, Gradient-Based Learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional Neural Networks, that are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques.Real-life document recognition systems are composed of multiple modules including eld extraction, segmentation, recognition, and language modeling. A new learning paradigm, called Graph Transformer Networks (GTN), allows such multi-module systems to be trained globally using Gradient-Based methods so as to minimize an overall performance measure.Two systems for on-line handwriting recognition are described. Experiments demonstrate the advantage of global training, and the exibility of Graph Transformer Networks.A Graph Transformer Network for reading bank check is also described. It uses Convolutional Neural Network character recognizers combined with global training techniques to provides record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
translated by 谷歌翻译
Neural networks with random weights appear in a variety of machine learning applications, most prominently as the initialization of many deep learning algorithms and as a computationally cheap alternative to fully learned neural networks. In the present article, we enhance the theoretical understanding of random neural networks by addressing the following data separation problem: under what conditions can a random neural network make two classes $\mathcal{X}^-, \mathcal{X}^+ \subset \mathbb{R}^d$ (with positive distance) linearly separable? We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. Crucially, the number of required neurons is explicitly linked to geometric properties of the underlying sets $\mathcal{X}^-, \mathcal{X}^+$ and their mutual arrangement. This instance-specific viewpoint allows us to overcome the usual curse of dimensionality (exponential width of the layers) in non-pathological situations where the data carries low-complexity structure. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity (based on a localized version of Gaussian mean width), which leads to sound and informative separation guarantees. We connect our result with related lines of work on approximation, memorization, and generalization.
translated by 谷歌翻译
当我们扩大数据集,模型尺寸和培训时间时,深入学习方法的能力中存在越来越多的经验证据。尽管有一些关于这些资源如何调节统计能力的说法,但对它们对模型培训的计算问题的影响知之甚少。这项工作通过学习$ k $ -sparse $ n $ bits的镜头进行了探索,这是一个构成理论计算障碍的规范性问题。在这种情况下,我们发现神经网络在扩大数据集大小和运行时间时会表现出令人惊讶的相变。特别是,我们从经验上证明,通过标准培训,各种体系结构以$ n^{o(k)} $示例学习稀疏的平等,而损失(和错误)曲线在$ n^{o(k)}后突然下降。 $迭代。这些积极的结果几乎匹配已知的SQ下限,即使没有明确的稀疏性先验。我们通过理论分析阐明了这些现象的机制:我们发现性能的相变不到SGD“在黑暗中绊倒”,直到它找到了隐藏的特征集(自然算法也以$ n^中的方式运行{o(k)} $ time);取而代之的是,我们表明SGD逐渐扩大了人口梯度的傅立叶差距。
translated by 谷歌翻译