Transformers have been essential to pretraining success in NLP. Other architectures have been used, but require attention layers to match benchmark accuracy. This work explores pretraining without attention. We test recently developed routing layers based on state-space models (SSM) and model architectures based on multiplicative gating. Used together these modeling choices have a large impact on pretraining accuracy. Empirically the proposed Bidirectional Gated SSM (BiGS) replicates BERT pretraining results without attention and can be extended to long-form pretraining of 4096 tokens without approximation.
translated by 谷歌翻译
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
translated by 谷歌翻译
线性时间不变的状态空间模型(SSM)是工程和统计数据的经典模型,最近通过结构化状态空间序列模型(S4)证明,在机器学习中非常有前途。 S4的核心成分涉及将SSM状态矩阵初始化为称为HIPPO矩阵的特定矩阵,这对于S4处理长序列的能力在经验上很重要。但是,S4使用的特定矩阵实际上是在特定时间变化的动态系统中得出的,并且将此矩阵用作时间不变的SSM没有已知的数学解释。因此,S4模拟远程依赖性的理论机制实际上仍无法解释。我们得出了河马框架的更一般和直观的公式,该框架将S4作为对指数型的Legendre多项式的分解提供了简单的数学解释,解释了其捕获长依赖性的能力。我们的概括引入了理论上丰富的SSM类,还使我们能够为其他碱基(例如傅立叶基础)得出更直观的S4变体,并解释了训练S4的其他方面,例如如何初始化重要的时间表参数。这些见解将S4的性能提高到远程竞技场基准的86%,在最困难的Path-X任务中,S4的性能为96%。
translated by 谷歌翻译
最近已证明状态空间模型(SSM)是深度学习层非常有效的,它是序列模型(例如RNN,CNN或变压器)的有前途替代方案。第一个显示这种潜力的版本是S4模型,它通过使用称为HIPPO矩阵的规定状态矩阵对涉及长期依赖性的任务特别有效。尽管这具有可解释的数学机制来建模长期依赖性,但它引入了一种自定义表示和算法,可能难以实施。另一方面,最新的S4变体称为DSS,表明将状态矩阵完全对角线限制在使用基于近似S4矩阵的特定初始化时,仍然可以保留原始模型的性能。这项工作旨在系统地了解如何参数化和初始化此类对角线状态空间模型。虽然从经典的结果来看,几乎所有SSM都具有等效的对角线形式,但我们表明初始化对于性能至关重要。我们通过证明S4矩阵的对角线限制出人意料地在无限状态尺寸的极限中恢复了相同的内核来解释为什么DSS在数学上起作用。我们还系统地描述了参数化和计算对角线SSM的各种设计选择,并执行对这些选择的影响的受控经验研究。我们的最终型号S4D是S4的简单对角线版本,其内核计算仅需要2行代码,并且几乎在所有设置中都与S4相当地执行,并具有最新的图像,音频和医疗时间序列域的结果,在远程竞技场基准中平均为85%。
translated by 谷歌翻译
由于一系列理想的模型属性,卷积神经网络(CNN)的使用在深度学习中被广泛扩展,这导致了有效有效的机器学习框架。但是,必须将CNN架构定制为特定任务,以结合输入长度,分辨率和尺寸的考虑因素。在这项工作中,我们通过连续的卷积神经网络(CCNN)克服了针对特定问题的CNN体​​系结构的需求:一个配备了连续卷积内核的单个CNN体系结构,可用于根据任意分辨率,维度,长度和长度的数据进行任务,而无需结构性长度变化。连续的卷积内核在每一层的远距离依赖性模型,并消除当前CNN体系结构中所需的降采样层和任务依赖性深度的需求。我们通过将相同的CCNN应用于顺序(1 $ \ mathrm {d} $)和视觉数据(2 $ \ mathrm {d} $)上的一系列任务来显示我们方法的普遍性。我们的CCNN竞争性能,并且在所有考虑的所有任务中通常都优于当前最新的。
translated by 谷歌翻译
序列建模的一个中心目标是设计一个单个原则模型,该模型可以解决各种方式和任务,尤其是在远程依赖方面的序列数据。尽管包括RNN,CNN和Transformers在内的传统模型具有用于捕获长期依赖性的专业变体,但它们仍然很难扩展到长时间的10000美元或更多步骤。通过模拟基本状态空间模型(SSM)\(x'(t)= ax(t)= ax(t) + bu(t),y(t)= cx(t) + du(t) + du(t)\ ), and showed that for appropriate choices of the state matrix \( A \), this system could handle long-range dependencies mathematically and empirically.但是,该方法具有过度的计算和内存需求,使其无法作为一般序列建模解决方案。我们根据SSM的新参数化提出了结构化状态空间序列模型(S4),并表明它可以比以前的方法更有效地计算出其理论强度。我们的技术涉及对\(a \)进行低级校正的调节,从而使其对角度稳定,并将SSM降低到库奇内核的精心研究的计算中。 S4在各种既定的基准测试范围内取得了强劲的经验结果,包括(i)在顺序CIFAR-10上的91 \%精度,没有数据增强或辅助损失,与较大的2-D Resnet相当,(ii)实质上关闭。在图像和语言建模任务上与变形金刚的差距,同时在远程竞技场基准的每个任务上执行每一代$ 60 \ times $ $(iii)sota,包括求解所有先前工作的挑战性path-x任务,而所有先前工作的长度为16K,同时与所有竞争对手一样高效。
translated by 谷歌翻译
A step-search sequential quadratic programming method is proposed for solving nonlinear equality constrained stochastic optimization problems. It is assumed that constraint function values and derivatives are available, but only stochastic approximations of the objective function and its associated derivatives can be computed via inexact probabilistic zeroth- and first-order oracles. Under reasonable assumptions, a high-probability bound on the iteration complexity of the algorithm to approximate first-order stationarity is derived. Numerical results on standard nonlinear optimization test problems illustrate the advantages and limitations of our proposed method.
translated by 谷歌翻译
Large training data and expensive model tweaking are standard features of deep learning for images. As a result, data owners often utilize cloud resources to develop large-scale complex models, which raises privacy concerns. Existing solutions are either too expensive to be practical or do not sufficiently protect the confidentiality of data and models. In this paper, we study and compare novel \emph{image disguising} mechanisms, DisguisedNets and InstaHide, aiming to achieve a better trade-off among the level of protection for outsourced DNN model training, the expenses, and the utility of data. DisguisedNets are novel combinations of image blocktization, block-level random permutation, and two block-level secure transformations: random multidimensional projection (RMT) and AES pixel-level encryption (AES). InstaHide is an image mixup and random pixel flipping technique \cite{huang20}. We have analyzed and evaluated them under a multi-level threat model. RMT provides a better security guarantee than InstaHide, under the Level-1 adversarial knowledge with well-preserved model quality. In contrast, AES provides a security guarantee under the Level-2 adversarial knowledge, but it may affect model quality more. The unique features of image disguising also help us to protect models from model-targeted attacks. We have done an extensive experimental evaluation to understand how these methods work in different settings for different datasets.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译
Solving real-world optimal control problems are challenging tasks, as the system dynamics can be highly non-linear or including nonconvex objectives and constraints, while in some cases the dynamics are unknown, making it hard to numerically solve the optimal control actions. To deal with such modeling and computation challenges, in this paper, we integrate Neural Networks with the Pontryagin's Minimum Principle (PMP), and propose a computationally efficient framework NN-PMP. The resulting controller can be implemented for systems with unknown and complex dynamics. It can not only utilize the accurate surrogate models parameterized by neural networks, but also efficiently recover the optimality conditions along with the optimal action sequences via PMP conditions. A toy example on a nonlinear Martian Base operation along with a real-world lossy energy storage arbitrage example demonstrates our proposed NN-PMP is a general and versatile computation tool for finding optimal solutions. Compared with solutions provided by the numerical optimization solver with approximated linear dynamics, NN-PMP achieves more efficient system modeling and higher performance in terms of control objectives.
translated by 谷歌翻译