深度神经网络已成功地应用于广泛的问题,在这些问题中,过度参数产生了部分随机的权重矩阵。重量矩阵奇异向量与搬运工 - 托马斯分布的比较表明,在奇异值频谱中随机性和学习的信息之间存在边界。受此发现的启发,我们引入了一种用于噪声滤波的算法,该算法既去除奇异值,又减小了较大的奇异值的大小,以抵消噪声和频谱信息部分之间的水平排斥的影响。对于在存在标签噪声的情况下训练的网络,我们确实发现,由于噪声过滤,概括性能大大提高。
translated by 谷歌翻译
尖峰神经网络(SNN)是大脑中低功率,耐断层的信息处理的基础,并且在适当的神经形态硬件加速器上实施时,可能构成传统深层神经网络的能力替代品。但是,实例化解决复杂的计算任务的SNN在Silico中仍然是一个重大挑战。替代梯度(SG)技术已成为培训SNN端到端的标准解决方案。尽管如此,它们的成功取决于突触重量初始化,类似于常规的人工神经网络(ANN)。然而,与ANN不同,它仍然难以捉摸地构成SNN的良好初始状态。在这里,我们为受到大脑中通常观察到的波动驱动的策略启发的SNN制定了一般初始化策略。具体而言,我们为数据依赖性权重初始化提供了实用的解决方案,以确保广泛使用的泄漏的集成和传火(LIF)神经元的波动驱动。我们从经验上表明,经过SGS培训时,SNN遵循我们的策略表现出卓越的学习表现。这些发现概括了几个数据集和SNN体系结构,包括完全连接,深度卷积,经常性和更具生物学上合理的SNN遵守Dale的定律。因此,波动驱动的初始化提供了一种实用,多功能且易于实现的策略,可改善神经形态工程和计算神经科学的不同任务的SNN培训绩效。
translated by 谷歌翻译
尽管过度拟合并且更普遍地,双重下降在机器学习中无处不在,但增加了最广泛使用的张量网络的参数数量,但矩阵乘积状态(MPS)通常会导致先前研究中的测试性能单调改善。为了更好地理解由MPS参数参数的体系结构的概括属性,我们构建了人工数据,这些数据可以由MPS精确建模并使用不同数量的参数训练模型。我们观察到一维数据的模型过于拟合,但也发现,对于更复杂的数据而言,过度拟合的意义较低,而对于MNIST图像数据,我们找不到任何过拟合的签名。我们推测,MPS的概括属性取决于数据的属性:具有一维数据(MPS ANSATZ是最合适的)MPS容易拟合的数据,而使用更复杂的数据,该数据不能完全适合MPS,过度拟合过度。可能不那么重要。
translated by 谷歌翻译
We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PRe-LUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
translated by 谷歌翻译
Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called susceptibility, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.
translated by 谷歌翻译
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
translated by 谷歌翻译
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
translated by 谷歌翻译
我们研究由SGD的变体训练的Relu神经网络的隐式偏置,其中在每个步骤中,标签以概率$ P $更改为随机标签(标记平滑是该过程的关闭变体)。我们的实验表明,标签噪声在以下意义上推动网络到稀疏解决方案:对于典型的输入,一小部分神经元是有效的,并且隐藏层的烧制图案是稀疏的。实际上,对于某些情况,适当的标签噪声不仅缩小网络,而且还减少了测试错误。然后,我们转向这些稀疏机制的理论分析,重点关注$ p = 1 $的极值案例。我们展示在这种情况下,网络沿着实验预期,但令人惊讶的是,以不同的方式依赖于学习率和偏见的存在,有重量消失或释放的神经元。
translated by 谷歌翻译
We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewise-affine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets, DenseNets, and other networks with skip connections at the other. For regression and classification tasks that optimize the squared-error loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a network with skip connections (such as a ResNet or DenseNet) is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of such a network is less erratic, less eccentric, and features local minima that are more accommodating to gradient-based optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network's singular values, regardless of its architecture.
translated by 谷歌翻译
尽管他们的成功庞大,但培训成功的深度神经网络仍然依赖于实验选择架构,超参数,初始化和培训机制。在这项工作中,我们专注于确定标准梯度下降方法的成功,用于在指定的数据集,体系结构和初始化(DAI)组合上培训深度神经网络。通过广泛的系统实验,我们表明,从DNN的隐藏层获得的矩阵的奇异值的演变可以帮助确定渐变滴定技术的成功,即使在监督学习中没有验证标签的情况下也是如此范例。这种现象可以促进早期放弃,停止训练神经网络,这些网络预计不会概括良好,在训练过程中。我们对多个数据集,架构和初始化的实验表明,所提出的分数可以更准确地预测DAI的成功,而只是依赖于早期时期的验证准确性来作出判断。
translated by 谷歌翻译
最近已证明自我监督的对比学习(CL)非常有效地防止深网贴上嘈杂的标签。尽管取得了经验成功,但对对比度学习对增强鲁棒性的影响的理论理解非常有限。在这项工作中,我们严格地证明,通过对比度学习学到的表示矩阵可以通过:(i)与数据中每个子类相对应的一个突出的奇异值来增强鲁棒性,并显着较小的剩余奇异值; (ii){{显着的单数矢量与每个子类的干净标签之间的一个很大的对齐。以上属性使对此类表示的线性层能够有效地学习干净的标签,而不会过度适应噪音。}我们进一步表明,通过对比度学习预先训练的深网的雅各比式的低级别结构使他们能够获得优越的最初的性能是在嘈杂的标签上进行微调时。最后,我们证明了对比度学习提供的最初鲁棒性使鲁棒训练方法能够在极端噪声水平下实现最先进的性能,例如平均27.18 \%\%和15.58 \%\%\%\%\%cifar-10上的提高和80 \%对称嘈杂标签的CIFAR-100,网络视频的准确性提高4.11 \%。
translated by 谷歌翻译
Label noise is a significant obstacle in deep learning model training. It can have a considerable impact on the performance of image classification models, particularly deep neural networks, which are especially susceptible because they have a strong propensity to memorise noisy labels. In this paper, we have examined the fundamental concept underlying related label noise approaches. A transition matrix estimator has been created, and its effectiveness against the actual transition matrix has been demonstrated. In addition, we examined the label noise robustness of two convolutional neural network classifiers with LeNet and AlexNet designs. The two FashionMINIST datasets have revealed the robustness of both models. We are not efficiently able to demonstrate the influence of the transition matrix noise correction on robustness enhancements due to our inability to correctly tune the complex convolutional neural network model due to time and computing resource constraints. There is a need for additional effort to fine-tune the neural network model and explore the precision of the estimated transition model in future research.
translated by 谷歌翻译
深神经网络(DNN)是用于压缩和蒸馏信息的强大工具。由于它们的规模和复杂性,通常涉及数十亿间相互作用的内部自由度,精确分析方法通常会缩短。这种情况下的共同策略是识别平均潜在的快速微观变量的不稳定行为的缓慢自由度。在这里,我们在训练结束时识别在过度参数化的深卷积神经网络(CNNS)中发生的尺度的分离。它意味着神经元预激活与几乎高斯的方式与确定性潜在内核一起波动。在对于具有无限许多频道的CNN来说,这些内核是惰性的,对于有限的CNNS,它们以分析的方式通过数据适应和学习数据。由此产生的深度学习的热力学理论产生了几种深度非线性CNN玩具模型的准确预测。此外,它还提供了新的分析和理解CNN的方法。
translated by 谷歌翻译
Understanding the functional principles of information processing in deep neural networks continues to be a challenge, in particular for networks with trained and thus non-random weights. To address this issue, we study the mapping between probability distributions implemented by a deep feed-forward network. We characterize this mapping as an iterated transformation of distributions, where the non-linearity in each layer transfers information between different orders of correlation functions. This allows us to identify essential statistics in the data, as well as different information representations that can be used by neural networks. Applied to an XOR task and to MNIST, we show that correlations up to second order predominantly capture the information processing in the internal layers, while the input layer also extracts higher-order correlations from the data. This analysis provides a quantitative and explainable perspective on classification.
translated by 谷歌翻译
利用数据不变对于人工和生物神经回路的有效学习至关重要。因此,了解神经网络如何发现能够利用其投入的基础对称性的适当表示,因此对于机器学习和神经科学至关重要。例如,卷积神经网络旨在利用翻译对称性及其功能触发了第一波深度学习成功。但是,迄今为止,从具有完全连接的网络的翻译不变数据中学习卷积已经被证明难以捉摸。在这里,我们展示了最初完全连接的神经网络解决歧视任务的神经网络如何直接从其输入中学习卷积结构,从而导致局部,空间铺设的接受场。这些接收场与经过同一任务训练的卷积网络的过滤器相匹配。通过精心设计视觉场景的数据模型,我们表明这种模式的出现是由输入的非高斯,高阶的局部结构触发的,该结构长期以来一直被认为是自然图像的标志。我们在简单的模型中提供了负责这种现象的模式形成机制的分析和数值表征,并在接受场形成与高阶输入相关性的张量分解之间找到了意外的联系。这些结果为各种感觉方式的低级特征探测器的发展提供了新的观点,并为研究高阶统计数据对神经网络学习的影响铺平了道路。
translated by 谷歌翻译
我们研究了使用尖刺,现场依赖的随机矩阵理论研究迷你批次对深神经网络损失景观的影响。我们表明,批量黑森州的极值值的大小大于经验丰富的黑森州。我们还获得了类似的结果对Hessian的概括高斯牛顿矩阵近似。由于我们的定理,我们推导出作为批量大小的最大学习速率的分析表达式,为随机梯度下降(线性缩放)和自适应算法(例如ADAM(Square Root Scaling)提供了通知实际培训方案,例如光滑,非凸深神经网络。虽然随机梯度下降的线性缩放是在我们概括的更多限制性条件下导出的,但是适应优化者的平方根缩放规则是我们的知识,完全小说。随机二阶方法和自适应方法的百分比,我们得出了最小阻尼系数与学习率与批量尺寸的比率成比例。我们在Cifar-$ 100 $和ImageNet数据集上验证了我们的VGG / WimerEsnet架构上的索赔。根据我们对象检的调查,我们基于飞行学习率和动量学习者开发了一个随机兰齐齐竞争,这避免了对这些关键的超参数进行昂贵的多重评估的需求,并在预残留的情况下显示出良好的初步结果Cifar的architecure - $ 100 $。
translated by 谷歌翻译
在许多情况下,更简单的模型比更复杂的模型更可取,并且该模型复杂性的控制是机器学习中许多方法的目标,例如正则化,高参数调整和体系结构设计。在深度学习中,很难理解复杂性控制的潜在机制,因为许多传统措施并不适合深度神经网络。在这里,我们开发了几何复杂性的概念,该概念是使用离散的dirichlet能量计算的模型函数变异性的量度。使用理论论据和经验结果的结合,我们表明,许多常见的训练启发式方法,例如参数规范正规化,光谱规范正则化,平稳性正则化,隐式梯度正则化,噪声正则化和参数初始化的选择,都可以控制几何学复杂性,并提供一个统一的框架,以表征深度学习模型的行为。
translated by 谷歌翻译
最近,与培训样本相比,具有越来越多的网络参数的过度参数深度网络主导了现代机器学习的性能。但是,当培训数据被损坏时,众所周知,过度参数化的网络往往会过度合适并且不会概括。在这项工作中,我们提出了一种有原则的方法,用于在分类任务中对过度参数的深层网络进行强有力的培训,其中一部分培训标签被损坏。主要想法还很简单:标签噪声与从干净的数据中学到的网络稀疏且不一致,因此我们对噪声进行建模并学会将其与数据分开。具体而言,我们通过另一个稀疏的过度参数术语对标签噪声进行建模,并利用隐式算法正规化来恢复和分离基础损坏。值得注意的是,当在实践中使用如此简单的方法培训时,我们证明了针对各种真实数据集上标签噪声的最新测试精度。此外,我们的实验结果通过理论在简化的线性模型上证实,表明在不连贯的条件下稀疏噪声和低级别数据之间的精确分离。这项工作打开了许多有趣的方向,可以使用稀疏的过度参数化和隐式正则化来改善过度参数化模型。
translated by 谷歌翻译
在2015年和2019年之间,地平线的成员2020年资助的创新培训网络名为“Amva4newphysics”,研究了高能量物理问题的先进多变量分析方法和统计学习工具的定制和应用,并开发了完全新的。其中许多方法已成功地用于提高Cern大型Hadron撞机的地图集和CMS实验所执行的数据分析的敏感性;其他几个人,仍然在测试阶段,承诺进一步提高基本物理参数测量的精确度以及新现象的搜索范围。在本文中,在研究和开发的那些中,最相关的新工具以及对其性能的评估。
translated by 谷歌翻译
部分微分方程(PDE)用于对科学和工程中的各种动力系统进行建模。深度学习的最新进展使我们能够以新的方式解决维度的诅咒,从而在更高的维度中解决它们。但是,深度学习方法受到训练时间和记忆的约束。为了解决这些缺点,我们实施了张量神经网络(TNN),这是一种量子启发的神经网络体系结构,利用张量网络的想法来改进深度学习方法。我们证明,与经典密集神经网络(DNN)相比,TNN提供了明显的参数节省,同时获得了与经典密集的神经网络相同的准确性。此外,我们还展示了如何以相同的精度来比DNN更快地训练TNN。我们通过将它们应用于求解抛物线PDE,特别是Black-Scholes-Barenblatt方程,该方程广泛用于金融定价理论,基于基准测试。还讨论了进一步的例子,例如汉密尔顿 - 雅各比 - 贝尔曼方程。
translated by 谷歌翻译