Neural networks require careful weight initialization to prevent signals from exploding or vanishing. Existing initialization schemes solve this problem in specific cases by assuming that the network has a certain activation function or topology. It is difficult to derive such weight initialization strategies, and modern architectures therefore often use these same initialization schemes even though their assumptions do not hold. This paper introduces AutoInit, a weight initialization algorithm that automatically adapts to different neural network architectures. By analytically tracking the mean and variance of signals as they propagate through the network, AutoInit appropriately scales the weights at each layer to avoid exploding or vanishing signals. Experiments demonstrate that AutoInit improves performance of convolutional, residual, and transformer networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings, and does so more reliably than data-dependent initialization methods. This flexibility allows AutoInit to initialize models for everything from small tabular tasks to large datasets such as ImageNet. Such generality turns out particularly useful in neural architecture search and in activation function discovery. In these settings, AutoInit initializes each candidate appropriately, making performance evaluations more accurate. AutoInit thus serves as an automatic configuration tool that makes design of new neural network architectures more robust. The AutoInit package provides a wrapper around TensorFlow models and is available at https://github.com/cognizant-ai-labs/autoinit.
translated by 谷歌翻译
神经架构的创新促进了语言建模和计算机视觉中的重大突破。不幸的是,如果网络参数未正确初始化,新颖的架构通常会导致挑战超参数选择和培训不稳定。已经提出了许多架构特定的初始化方案,但这些方案并不总是可移植到新体系结构。本文介绍了毕业,一种用于初始化神经网络的自动化和架构不可知论由方法。毕业基础是一个简单的启发式;调整每个网络层的规范,使得具有规定的超参数的SGD或ADAM的单个步骤导致可能的损耗值最小。通过在每个参数块前面引入标量乘数变量,然后使用简单的数字方案优化这些变量来完成此调整。 GradInit加速了许多卷积架构的收敛性和测试性能,无论是否有跳过连接,甚至没有归一化层。它还提高了机器翻译的原始变压器架构的稳定性,使得在广泛的学习速率和动量系数下使用ADAM或SGD来训练它而无需学习速率预热。代码可在https://github.com/zhuchen03/gradinit上获得。
translated by 谷歌翻译
The choice of activation functions and their motivation is a long-standing issue within the neural network community. Neuronal representations within artificial neural networks are commonly understood as logits, representing the log-odds score of presence of features within the stimulus. We derive logit-space operators equivalent to probabilistic Boolean logic-gates AND, OR, and XNOR for independent probabilities. Such theories are important to formalize more complex dendritic operations in real neurons, and these operations can be used as activation functions within a neural network, introducing probabilistic Boolean-logic as the core operation of the neural network. Since these functions involve taking multiple exponents and logarithms, they are computationally expensive and not well suited to be directly used within neural networks. Consequently, we construct efficient approximations named $\text{AND}_\text{AIL}$ (the AND operator Approximate for Independent Logits), $\text{OR}_\text{AIL}$, and $\text{XNOR}_\text{AIL}$, which utilize only comparison and addition operations, have well-behaved gradients, and can be deployed as activation functions in neural networks. Like MaxOut, $\text{AND}_\text{AIL}$ and $\text{OR}_\text{AIL}$ are generalizations of ReLU to two-dimensions. While our primary aim is to formalize dendritic computations within a logit-space probabilistic-Boolean framework, we deploy these new activation functions, both in isolation and in conjunction to demonstrate their effectiveness on a variety of tasks including image classification, transfer learning, abstract reasoning, and compositional zero-shot learning.
translated by 谷歌翻译
培训深度神经网络是一项非常苛刻的任务,尤其是具有挑战性的是如何适应体系结构以提高训练有素的模型的性能。我们可以发现,有时,浅网络比深网概括得更好,并且增加更多层会导致更高的培训和测试错误。深层残留学习框架通过将跳过连接添加到几个神经网络层来解决此降解问题。最初,需要这种跳过连接才能成功地训练深层网络,因为网络的表达性会随着深度的指数增长而成功。在本文中,我们首先通过神经网络分析信息流。我们介绍和评估批处理循环,该批处理通过神经网络的每一层量化信息流。我们从经验和理论上证明,基于梯度下降的训练方法需要正面批处理融合,以成功地优化给定的损失功能。基于这些见解,我们引入了批处理凝聚正则化,以使基于梯度下降的训练算法能够单独通过每个隐藏层来优化信息流。借助批处理正则化,梯度下降优化器可以将不可吸引的网络转换为可训练的网络。我们从经验上表明,因此我们可以训练“香草”完全连接的网络和卷积神经网络 - 没有跳过连接,批处理标准化,辍学或任何其他建筑调整 - 只需将批处理 - 凝集正则术语添加到500层中损失功能。批处理 - 注入正则化的效果不仅在香草神经网络上评估,还评估了在各种计算机视觉以及自然语言处理任务上的剩余网络,自动编码器以及变压器模型上。
translated by 谷歌翻译
批量标准化是所有最先进的神经网络架构的必要组件。然而,由于引入了许多实际问题,最近的研究已经致力于设计无规范化的架构。在本文中,我们表明权重初始化是培训Reset的归一化网络的关键。特别是,我们提出了对跳过连接分支的块输出的求和操作的略微修改,从而正确初始化整个网络。我们表明,这种修改的体系结构在CiFar-10上实现了竞争结果,而无需进一步正常化,也不是算法修改。
translated by 谷歌翻译
深度学习文献通过新的架构和培训技术不断更新。然而,尽管有一些关于随机权重的发现,但最近的研究却忽略了重量初始化。另一方面,最近的作品一直在接近网络科学,以了解训练后人工神经网络(ANN)的结构和动态。因此,在这项工作中,我们分析了随机初始化网络中神经元的中心性。我们表明,较高的神经元强度方差可能会降低性能,而较低的神经元强度方差通常会改善它。然后,提出了一种新方法,根据其强度根据优先附着(PA)规则重新连接神经元连接,从而大大降低了通过常见方法初始化的层的强度方差。从这个意义上讲,重新布线仅重新组织连接,同时保留权重的大小和分布。我们通过对图像分类进行的广泛统计分析表明,在使用简单和复杂的体系结构和学习时间表时,在大多数情况下,在培训和测试过程中,性能都会提高。我们的结果表明,除了规模外,权重的组织也与更好的初始化初始化有关。
translated by 谷歌翻译
这本数字本书包含在物理模拟的背景下与深度学习相关的一切实际和全面的一切。尽可能多,所有主题都带有Jupyter笔记本的形式的动手代码示例,以便快速入门。除了标准的受监督学习的数据中,我们将看看物理丢失约束,更紧密耦合的学习算法,具有可微分的模拟,以及加强学习和不确定性建模。我们生活在令人兴奋的时期:这些方法具有从根本上改变计算机模拟可以实现的巨大潜力。
translated by 谷歌翻译
分批归一化(BN)由归一化组成部分,然后是仿射转化,并且对于训练深神经网络至关重要。网络中每个BN的标准初始化分别设置了仿射变换量表,并将其转移到1和0。但是,经过训练,我们观察到这些参数从初始化中并没有太大变化。此外,我们注意到归一化过程仍然可以产生过多的值,这对于训练是不可能的。我们重新审视BN公式,并为BN提出了一种新的初始化方法和更新方法,以解决上述问题。实验旨在强调和证明适当的BN规模初始化对性能的积极影响,并使用严格的统计显着性测试进行评估。该方法可以与现有实施方式一起使用,没有额外的计算成本。源代码可在https://github.com/osu-cvl/revisiting-bninit上获得。
translated by 谷歌翻译
In the past few years, neural architecture search (NAS) has become an increasingly important tool within the deep learning community. Despite the many recent successes of NAS, however, most existing approaches operate within highly structured design spaces, and hence explore only a small fraction of the full search space of neural architectures while also requiring significant manual effort from domain experts. In this work, we develop techniques that enable efficient NAS in a significantly larger design space. To accomplish this, we propose to perform NAS in an abstract search space of program properties. Our key insights are as follows: (1) the abstract search space is significantly smaller than the original search space, and (2) architectures with similar program properties also have similar performance; thus, we can search more efficiently in the abstract search space. To enable this approach, we also propose a novel efficient synthesis procedure, which accepts a set of promising program properties, and returns a satisfying neural architecture. We implement our approach, $\alpha$NAS, within an evolutionary framework, where the mutations are guided by the program properties. Starting with a ResNet-34 model, $\alpha$NAS produces a model with slightly improved accuracy on CIFAR-10 but 96% fewer parameters. On ImageNet, $\alpha$NAS is able to improve over Vision Transformer (30% fewer FLOPS and parameters), ResNet-50 (23% fewer FLOPS, 14% fewer parameters), and EfficientNet (7% fewer FLOPS and parameters) without any degradation in accuracy.
translated by 谷歌翻译
深度学习使用由其重量进行参数化的神经网络。通常通过调谐重量来直接最小化给定损耗功能来训练神经网络。在本文中,我们建议将权重重新参数转化为网络中各个节点的触发强度的目标。给定一组目标,可以计算使得发射强度最佳地满足这些目标的权重。有人认为,通过我们称之为级联解压缩的过程,使用培训的目标解决爆炸梯度的问题,并使损失功能表面更加光滑,因此导致更容易,培训更快,以及潜在的概括,神经网络。它还允许更容易地学习更深层次和经常性的网络结构。目标对重量的必要转换有额外的计算费用,这是在许多情况下可管理的。在目标空间中学习可以与现有的神经网络优化器相结合,以额外收益。实验结果表明了使用目标空间的速度,以及改进的泛化的示例,用于全连接的网络和卷积网络,以及调用和处理长时间序列的能力,并使用经常性网络进行自然语言处理。
translated by 谷歌翻译
本文提出了一种新的和富有激光激活方法,被称为FPLUS,其利用具有形式的极性标志的数学功率函数。它是通过常见的逆转操作来启发,同时赋予仿生学的直观含义。制剂在某些先前知识和预期特性的条件下理论上得出,然后通过使用典型的基准数据集通过一系列实验验证其可行性,其结果表明我们的方法在许多激活功能中拥有卓越的竞争力,以及兼容稳定性许多CNN架构。此外,我们将呈现给更广泛类型的功能延伸到称为PFPlus的函数,具有两个可以固定的或学习的参数,以便增加其表现力的容量,并且相同的测试结果验证了这种改进。
translated by 谷歌翻译
Alphazero,Leela Chess Zero和Stockfish Nnue革新了计算机国际象棋。本书对此类引擎的技术内部工作进行了完整的介绍。该书分为四个主要章节 - 不包括第1章(简介)和第6章(结论):第2章引入神经网络,涵盖了所有用于构建深层网络的基本构建块,例如Alphazero使用的网络。内容包括感知器,后传播和梯度下降,分类,回归,多层感知器,矢量化技术,卷积网络,挤压网络,挤压和激发网络,完全连接的网络,批处理归一化和横向归一化和跨性线性单位,残留层,剩余层,过度效果和底漆。第3章介绍了用于国际象棋发动机以及Alphazero使用的经典搜索技术。内容包括minimax,alpha-beta搜索和蒙特卡洛树搜索。第4章展示了现代国际象棋发动机的设计。除了开创性的Alphago,Alphago Zero和Alphazero我们涵盖Leela Chess Zero,Fat Fritz,Fat Fritz 2以及有效更新的神经网络(NNUE)以及MAIA。第5章是关于实施微型α。 Shexapawn是国际象棋的简约版本,被用作为此的示例。 Minimax搜索可以解决六ap峰,并产生了监督学习的培训位置。然后,作为比较,实施了类似Alphazero的训练回路,其中通过自我游戏进行训练与强化学习结合在一起。最后,比较了类似α的培训和监督培训。
translated by 谷歌翻译
The automated machine learning (AutoML) field has become increasingly relevant in recent years. These algorithms can develop models without the need for expert knowledge, facilitating the application of machine learning techniques in the industry. Neural Architecture Search (NAS) exploits deep learning techniques to autonomously produce neural network architectures whose results rival the state-of-the-art models hand-crafted by AI experts. However, this approach requires significant computational resources and hardware investments, making it less appealing for real-usage applications. This article presents the third version of Pareto-Optimal Progressive Neural Architecture Search (POPNASv3), a new sequential model-based optimization NAS algorithm targeting different hardware environments and multiple classification tasks. Our method is able to find competitive architectures within large search spaces, while keeping a flexible structure and data processing pipeline to adapt to different tasks. The algorithm employs Pareto optimality to reduce the number of architectures sampled during the search, drastically improving the time efficiency without loss in accuracy. The experiments performed on images and time series classification datasets provide evidence that POPNASv3 can explore a large set of assorted operators and converge to optimal architectures suited for the type of data provided under different scenarios.
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
An activation function has a significant impact on the efficiency and robustness of the neural networks. As an alternative, we evolved a cutting-edge non-monotonic activation function, Negative Stimulated Hybrid Activation Function (Nish). It acts as a Rectified Linear Unit (ReLU) function for the positive region and a sinus-sigmoidal function for the negative region. In other words, it incorporates a sigmoid and a sine function and gaining new dynamics over classical ReLU. We analyzed the consistency of the Nish for different combinations of essential networks and most common activation functions using on several most popular benchmarks. From the experimental results, we reported that the accuracy rates achieved by the Nish is slightly better than compared to the Mish in classification.
translated by 谷歌翻译
Generative Adversarial Networks (GANs) were introduced by Goodfellow in 2014, and since then have become popular for constructing generative artificial intelligence models. However, the drawbacks of such networks are numerous, like their longer training times, their sensitivity to hyperparameter tuning, several types of loss and optimization functions and other difficulties like mode collapse. Current applications of GANs include generating photo-realistic human faces, animals and objects. However, I wanted to explore the artistic ability of GANs in more detail, by using existing models and learning from them. This dissertation covers the basics of neural networks and works its way up to the particular aspects of GANs, together with experimentation and modification of existing available models, from least complex to most. The intention is to see if state of the art GANs (specifically StyleGAN2) can generate album art covers and if it is possible to tailor them by genre. This was attempted by first familiarizing myself with 3 existing GANs architectures, including the state of the art StyleGAN2. The StyleGAN2 code was used to train a model with a dataset containing 80K album cover images, then used to style images by picking curated images and mixing their styles.
translated by 谷歌翻译
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR-10, ImageNet, Penn Treebank and WikiText-2 show that our algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques. Our implementation has been made publicly available to facilitate further research on efficient architecture search algorithms.
translated by 谷歌翻译
Neural networks have proven effective at solving difficult problems but designing their architectures can be challenging, even for image classification problems alone. Our goal is to minimize human participation, so we employ evolutionary algorithms to discover such networks automatically. Despite significant computational requirements, we show that it is now possible to evolve models with accuracies within the range of those published in the last year. Specifically, we employ simple evolutionary techniques at unprecedented scales to discover models for the CIFAR-10 and CIFAR-100 datasets, starting from trivial initial conditions and reaching accuracies of 94.6% (95.6% for ensemble) and 77.0%, respectively. To do this, we use novel and intuitive mutation operators that navigate large search spaces; we stress that no human participation is required once evolution starts and that the output is a fully-trained model. Throughout this work, we place special emphasis on the repeatability of results, the variability in the outcomes and the computational requirements.
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译