Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: $\sim$70% of attention heads and $\sim$20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, suggesting that induction heads are among the heads capable of more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained to perform in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.
translated by 谷歌翻译
End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.
translated by 谷歌翻译
自动语音识别(ASR)系统已经发现它们在非常多样化的域中的众多工业应用中使用。由于域 - 特定于域的系统比域名评估的通用对应力更好,因此对内存和计算有效的域适应的需要是显而易见的。特别是,适用用于救援ASR假设的基于参数的基于变压器的语言模型是具有挑战性的。在这项工作中,我们引入域提示,一种方法,该方法列举了少数域令牌嵌入参数以将基于变压器的LM归入特定域。只需少数额外的额外参数,我们通过使用未存在的LM的基线达到7-14%的效率。尽管具有参数效率,但这些改进与具有数亿参数的完全精细调谐模型的改进相当。通过提示,数据集大小,初始化和域的消融,我们提供了在ASR系统中使用域提示的优势的证据。
translated by 谷歌翻译
Advances in reinforcement learning have led to its successful application in complex tasks with continuous state and action spaces. Despite these advances in practice, most theoretical work pertains to finite state and action spaces. We propose building a theoretical understanding of continuous state and action spaces by employing a geometric lens. Central to our work is the idea that the transition dynamics induce a low dimensional manifold of reachable states embedded in the high-dimensional nominal state space. We prove that, under certain conditions, the dimensionality of this manifold is at most the dimensionality of the action space plus one. This is the first result of its kind, linking the geometry of the state space to the dimensionality of the action space. We empirically corroborate this upper bound for four MuJoCo environments. We further demonstrate the applicability of our result by learning a policy in this low dimensional representation. To do so we introduce an algorithm that learns a mapping to a low dimensional representation, as a narrow hidden layer of a deep neural network, in tandem with the policy using DDPG. Our experiments show that a policy learnt this way perform on par or better for four MuJoCo control suite tasks.
translated by 谷歌翻译
Deep neural networks can approximate functions on different types of data, from images to graphs, with varied underlying structure. This underlying structure can be viewed as the geometry of the data manifold. By extending recent advances in the theoretical understanding of neural networks, we study how a randomly initialized neural network with piece-wise linear activation splits the data manifold into regions where the neural network behaves as a linear function. We derive bounds on the density of boundary of linear regions and the distance to these boundaries on the data manifold. This leads to insights into the expressivity of randomly initialized deep neural networks on non-Euclidean data sets. We empirically corroborate our theoretical results using a toy supervised learning problem. Our experiments demonstrate that number of linear regions varies across manifolds and the results hold with changing neural network architectures. We further demonstrate how the complexity of linear regions is different on the low dimensional manifold of images as compared to the Euclidean space, using the MetFaces dataset.
translated by 谷歌翻译
经过良好策划的数据集的可用性推动了机器学习(ML)模型的成功。尽管对农业的地球观测数据的获取增加了,但仍有少数策划的标签数据集,这限制了其在训练ML模型中用于农业中的遥控模型的潜力。为此,我们介绍了一个首先的数据集,镰刀,在3个不同卫星的不同空间分辨率下具有时间序列图像,并用多个关键的裁剪参数注释,用于帕迪种植的帕迪耕种,用于泰米尔纳德邦的Cauvery Delta地区,印度。该数据集由388个独特地块的2398个季节样品组成,分布在三角洲的4个地区。该数据集涵盖了2018年1月3月2021日的时间段之间的多光谱,热和微波数据。稻田样品用4个关键的裁剪参数注释,即播种日期,移植日期,收获日期和作物收率。这是最早将生长季节(使用播种和收获日期)视为数据集的一部分的研究之一。我们还提出了一种产量预测策略,该策略使用基于观察到的生长季节以及该地区泰米尔纳德邦农业大学获得的标准季节性信息生成的时间序列数据。随之而来的绩效提高凸显了ML技术的影响,该技术利用了与特定地区的农民紧随其后的标准实践相一致的领域知识。我们在3个单独的任务上进行基准测试数据集,即作物类型,物候日期(播种,移植,收获)和产量预测,并开发了一个端到端框架,用于预测现实世界中的关键作物参数。
translated by 谷歌翻译
标签层次结构通常作为生物分类法或语言数据集的一部分可用。几项作品利用这些作品来学习层次结构意识到功能,以改善分类器,以在维持或减少总体错误的同时犯有语义有意义的错误。在本文中,我们提出了一种学习层次结构意识特征(HAF)的新方法,该方法利用分类器在每个层次结构级别上的分类器受到约束,以生成与标签层次结构一致的预测。分类器的训练是通过最大程度地减少从细粒分类器获​​得的目标软标签的Jensen Shannon差异来训练。此外,我们采用了简单的几何损失,该损失限制了特征空间几何形状以捕获标签空间的语义结构。 HAF是一种训练时间方法,可以改善错误,同时保持TOP-1错误,从而解决了跨凝性损失的问题,该问题将所有错误视为平等。我们在三个层次数据集上评估HAF,并在Inaturalist-19和Cifar-100数据集上实现最新结果。源代码可从https://github.com/07agarg/haf获得
translated by 谷歌翻译
我们在多变量时间序列预测(MTSF)的域中制定了一个新的推理任务,称为变量子集预报(VSF),其中仅在推理过程中可用一小部分变量子集。由于长期数据丢失(例如,传感器故障)或列车 /测试之间的高 - >低资源域移动,因此在推理过程中没有变量。据我们所知,在文献中尚未研究MTSF模型在存在此类故障的情况下的稳健性。通过广泛的评估,我们首先表明,在VSF设置中,最新方法的性能显着降低。我们提出了一种非参数包装技术,该技术可以应用于任何现有的预测模型。通过在4个数据集和5个预测模型的系统实验中,我们表明我们的技术能够恢复模型的接近95 \%性能,即使仅存在15 \%的原始变量。
translated by 谷歌翻译
事实证明,关系决策树的合奏模型(行李和梯度提升)被证明是概率逻辑模型(PLM)领域中最有效的学习方法之一。尽管有效,但他们失去了PLM的最重要方面之一 - 可解释性。在本文中,我们考虑将大量博学的树木压缩成单个可解释的模型的问题。为此,我们提出了COTE(树的压缩),该Cote将单个小型决策列表作为压缩表示形式。Cote首先将树木转换为决策清单,然后借助原始训练集执行组合和压缩。实验评估证明了COTE在几个基准关系数据集中的有效性。
translated by 谷歌翻译
我们提出了一种新型的参数化技能学习算法,旨在学习可转移的参数化技能并将其合成为新的动作空间,以支持长期任务中的有效学习。我们首先提出了新颖的学习目标 - 以轨迹为中心的多样性和平稳性 - 允许代理商能够重复使用的参数化技能。我们的代理商可以使用这些学习的技能来构建时间扩展的参数化行动马尔可夫决策过程,我们为此提出了一种层次的参与者 - 批判算法,旨在通过学习技能有效地学习高级控制政策。我们从经验上证明,所提出的算法使代理能够解决复杂的长途障碍源环境。
translated by 谷歌翻译