Common to all different kinds of recurrent neural networks (RNNs) is the intention to model relations between data points through time. When there is no immediate relationship between subsequent data points (like when the data points are generated at random, e.g.), we show that RNNs are still able to remember a few data points back into the sequence by memorizing them by heart using standard backpropagation. However, we also show that for classical RNNs, LSTM and GRU networks the distance of data points between recurrent calls that can be reproduced this way is highly limited (compared to even a loose connection between data points) and subject to various constraints imposed by the type and size of the RNN in question. This implies the existence of a hard limit (way below the information-theoretic one) for the distance between related data points within which RNNs are still able to recognize said relation.
translated by 谷歌翻译
In this paper we compare different types of recurrent units in recurrent neural networks (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our experiments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be comparable to LSTM.
translated by 谷歌翻译
In this study, we investigate the generalization of LSTM, ReLU and GRU models on counting tasks over long sequences. Previous theoretical work has established that RNNs with ReLU activation and LSTMs have the capacity for counting with suitable configuration, while GRUs have limitations that prevent correct counting over longer sequences. Despite this and some positive empirical results for LSTMs on Dyck-1 languages, our experimental results show that LSTMs fail to learn correct counting behavior for sequences that are significantly longer than in the training data. ReLUs show much larger variance in behavior and in most cases worse generalization. The long sequence generalization is empirically related to validation loss, but reliable long sequence generalization seems not practically achievable through backpropagation with current techniques. We demonstrate different failure modes for LSTMs, GRUs and ReLUs. In particular, we observe that the saturation of activation functions in LSTMs and the correct weight setting for ReLUs to generalize counting behavior are not achieved in standard training regimens. In summary, learning generalizable counting behavior is still an open problem and we discuss potential approaches for further research.
translated by 谷歌翻译
近年来,使用正交矩阵已被证明是通过训练,稳定性和收敛尤其是控制梯度来改善复发性神经网络(RNN)的一种有希望的方法。通过使用各种门和记忆单元,封闭的复发单元(GRU)和长期短期记忆(LSTM)体系结构解决了消失的梯度问题,但它们仍然容易出现爆炸梯度问题。在这项工作中,我们分析了GRU中的梯度,并提出了正交矩阵的使用,以防止梯度问题爆炸并增强长期记忆。我们研究了在哪里使用正交矩阵,并提出了基于Neumann系列的缩放尺度的Cayley转换,以训练GRU中的正交矩阵,我们称之为Neumann-cayley Orthoconal orthoconal Gru或简单的NC-GRU。我们介绍了有关几个合成和现实世界任务的模型的详细实验,这些实验表明NC-GRU明显优于GRU以及其他几个RNN。
translated by 谷歌翻译
在部分可观察域中的预测和规划的常见方法是使用经常性的神经网络(RNN),其理想地开发和维持关于隐藏,任务相关因素的潜伏。我们假设物理世界中的许多这些隐藏因素随着时间的推移是恒定的,而只是稀疏变化。为研究这一假设,我们提出了Gated $ L_0 $正规化的动态(Gatel0rd),一种新的经常性架构,它包含归纳偏差,以保持稳定,疏口改变潜伏状态。通过新颖的内部门控功能和潜在状态变化的$ l_0 $ norm的惩罚来实现偏差。我们证明Gatel0rd可以在各种部分可观察到的预测和控制任务中与最先进的RNN竞争或优于最先进的RNN。 Gatel0rd倾向于编码环境的基础生成因子,忽略了虚假的时间依赖性,并概括了更好的,提高了基于模型的规划和加强学习任务中的采样效率和整体性能。此外,我们表明可以容易地解释开发的潜在状态,这是朝着RNN中更好地解释的步骤。
translated by 谷歌翻译
translated by 谷歌翻译
Learning to store information over extended time intervals via recurrent backpropagation takes a very long time, mostly due to insu cient, decaying error back ow. We brie y review Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, e cient, gradient-based method called \Long Short-Term Memory" (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error ow through \constant error carrousels" within special units. Multiplicative gate units learn to open and close access to the constant error ow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never been solved by previous recurrent network solve long time lag problems. (2) It has fully connected second-order sigma-pi units, while the LSTM architecture's MUs are used only to gate access to constant error ow. (3) Watrous and Kuhn's algorithm costs O(W 2 ) operations per time step, ours only O(W), where W is the number of weights. See also Miller and Giles (1993) for additional work on MUs.Simple weight guessing. To avoid long time lag problems of gradient-based approaches we may simply randomly initialize all network weights until the resulting net happens to classify all training sequences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996, 1997 that simple weight guessing solves many of the problems in , Miller and Giles 1993, Lin et al. 1995 faster than the algorithms proposed therein. This does not mean that weight guessing is a good algorithm. It just means that the problems are very simple. More realistic tasks require either many free parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), such that guessing becomes completely infeasible.
translated by 谷歌翻译
Recent developments in quantum computing and machine learning have propelled the interdisciplinary study of quantum machine learning. Sequential modeling is an important task with high scientific and commercial value. Existing VQC or QNN-based methods require significant computational resources to perform the gradient-based optimization of a larger number of quantum circuit parameters. The major drawback is that such quantum gradient calculation requires a large amount of circuit evaluation, posing challenges in current near-term quantum hardware and simulation software. In this work, we approach sequential modeling by applying a reservoir computing (RC) framework to quantum recurrent neural networks (QRNN-RC) that are based on classical RNN, LSTM and GRU. The main idea to this RC approach is that the QRNN with randomly initialized weights is treated as a dynamical system and only the final classical linear layer is trained. Our numerical simulations show that the QRNN-RC can reach results comparable to fully trained QRNN models for several function approximation and time series prediction tasks. Since the QRNN training complexity is significantly reduced, the proposed model trains notably faster. In this work we also compare to corresponding classical RNN-based RC implementations and show that the quantum version learns faster by requiring fewer training epochs in most cases. Our results demonstrate a new possibility to utilize quantum neural network for sequential modeling with greater quantum hardware efficiency, an important design consideration for noisy intermediate-scale quantum (NISQ) computers.
translated by 谷歌翻译
We explore relations between the hyper-parameters of a recurrent neural network (RNN) and the complexity of string sequences it is able to memorize. We compare long short-term memory (LSTM) networks and gated recurrent units (GRUs). We find that an increase of RNN depth does not necessarily result in better memorization capability when the training time is constrained. Our results also indicate that the learning rate and the number of units per layer are among the most important hyper-parameters to be tuned. Generally, GRUs outperform LSTM networks on low complexity sequences while on high complexity sequences LSTMs perform better.
translated by 谷歌翻译
由于深度学习(DL)的成功及其日益增长的就业市场,来自许多地区的学生和研究人员都有兴趣了解DL技术。在此学习过程中,可视化已被证明具有很大的帮助。虽然大多数当前的教育可视化针对一个特定的架构或用例,但是能够处理顺序数据的经常性神经网络(RNN)尚未覆盖。尽管诸如文本数据(如文本和功能分析)的任务处于DL Research的最前沿。因此,我们提出了Explornn,这是RNN的第一个交互式探索的教育可视化。在使学习更容易和更有趣的基础上,我们定义了针对理解RNN的教育目标。我们使用这些目标来形成视觉设计过程的指导。通过Explornn,它可以在线访问,我们在粗略级别提供RNN的训练过程概述,同时还允许详细检查LSTM单元格内的数据流。在一个实证研究中,我们在受试者设计中评估了37个科目,以研究与经典文本的学习环境相比的Explornn的学习结果和认知负荷。虽然文本组中的学习者在肤浅的知识获取中,但Explornn特别有助于更深入地了解学习内容。此外,Exprornn中的复杂内容被认为明显更容易,并导致比文本组更少的无关紧额。该研究表明,对于诸如经常性网络的困难学习材料,深度理解是重要的,诸如Explornn等交互式可视化可能会有所帮助。
translated by 谷歌翻译
短期可塑性(STP)是一种将腐烂记忆存储在大脑皮质突触中的机制。在计算实践中,已经使用了STP,但主要是在尖峰神经元的细分市场中,尽管理论预测它是对某些动态任务的最佳解决方案。在这里,我们提出了一种新型的经常性神经单元,即STP神经元(STPN),它确实实现了惊人的功能。它的关键机制是,突触具有一个状态,通过与偶然性的自我连接在时间上传播。该公式使能够通过时间返回传播来训练可塑性,从而导致一种学习在短期内学习和忘记的形式。 STPN的表现优于所有测试的替代方案,即RNN,LSTMS,其他具有快速重量和可区分可塑性的型号。我们在监督和强化学习(RL)以及协会​​检索,迷宫探索,Atari视频游戏和Mujoco Robotics等任务中证实了这一点。此外,我们计算出,在神经形态或生物电路中,STPN最大程度地减少了模型的能量消耗,因为它会动态降低个体突触。基于这些,生物学STP可能是一种强大的进化吸引子,可最大程度地提高效率和计算能力。现在,STPN将这些神经形态的优势带入了广泛的机器学习实践。代码可从获得
translated by 谷歌翻译
在本文中,我们提供了一种系统的方法来评估和比较数字信号处理中神经网络层的计算复杂性。我们提供并链接四个软件到硬件的复杂性度量,定义了不同的复杂度指标与层的超参数的关系。本文解释了如何计算这四个指标以进行馈送和经常性层,并定义在这种情况下,我们应该根据我们是否表征了面向更软件或硬件的应用程序来使用特定的度量。新引入的四个指标之一,称为“添加和位移位数(NAB)”,用于异质量化。 NABS不仅表征了操作中使用的位宽的影响,还表征了算术操作中使用的量化类型。我们打算这项工作作为与神经网络在实时数字信号处理中应用相关的复杂性估计级别(目的)的基线,旨在统一计算复杂性估计。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Recurrent neural networks are a widely used class of neural architectures. They have, however, two shortcomings. First, they are often treated as black-box models and as such it is difficult to understand what exactly they learn as well as how they arrive at a particular prediction. Second, they tend to work poorly on sequences requiring long-term memorization, despite having this capacity in principle. We aim to address both shortcomings with a class of recurrent networks that use a stochastic state transition mechanism between cell applications. This mechanism, which we term state-regularization, makes RNNs transition between a finite set of learnable states. We evaluate state-regularized RNNs on (1) regular languages for the purpose of automata extraction; (2) non-regular languages such as balanced parentheses and palindromes where external memory is required; and (3) real-word sequence learning tasks for sentiment analysis, visual object recognition and text categorisation. We show that state-regularization (a) simplifies the extraction of finite state automata that display an RNN's state transition dynamic; (b) forces RNNs to operate more like automata with external memory and less like finite state machines, which potentiality leads to a more structural memory; (c) leads to better interpretability and explainability of RNNs by leveraging the probabilistic finite state transition mechanism over time steps.
translated by 谷歌翻译
We introduce organism networks, which function like a single neural network but are composed of several neural particle networks; while each particle network fulfils the role of a single weight application within the organism network, it is also trained to self-replicate its own weights. As organism networks feature vastly more parameters than simpler architectures, we perform our initial experiments on an arithmetic task as well as on simplified MNIST-dataset classification as a collective. We observe that individual particle networks tend to specialise in either of the tasks and that the ones fully specialised in the secondary task may be dropped from the network without hindering the computational accuracy of the primary task. This leads to the discovery of a novel pruning-strategy for sparse neural networks
translated by 谷歌翻译
Echo State Networks (ESN) are a type of Recurrent Neural Networks that yields promising results in representing time series and nonlinear dynamic systems. Although they are equipped with a very efficient training procedure, Reservoir Computing strategies, such as the ESN, require the use of high order networks, i.e. large number of layers, resulting in number of states that is magnitudes higher than the number of model inputs and outputs. This not only makes the computation of a time step more costly, but also may pose robustness issues when applying ESNs to problems such as Model Predictive Control (MPC) and other optimal control problems. One such way to circumvent this is through Model Order Reduction strategies such as the Proper Orthogonal Decomposition (POD) and its variants (POD-DEIM), whereby we find an equivalent lower order representation to an already trained high dimension ESN. The objective of this work is to investigate and analyze the performance of POD methods in Echo State Networks, evaluating their effectiveness. To this end, we evaluate the Memory Capacity (MC) of the POD-reduced network in comparison to the original (full order) ENS. We also perform experiments on two different numerical case studies: a NARMA10 difference equation and an oil platform containing two wells and one riser. The results show that there is little loss of performance comparing the original ESN to a POD-reduced counterpart, and also that the performance of a POD-reduced ESN tend to be superior to a normal ESN of the same size. Also we attain speedups of around $80\%$ in comparison to the original ESN.
translated by 谷歌翻译
translated by 谷歌翻译