及时调整尝试更新预训练模型中的一些特定任务参数。它的性能与在语言理解和发电任务上的完整参数设置的微调相当。在这项工作中,我们研究了迅速调整神经文本检索器的问题。我们引入参数效率的及时调整,以调整跨内域,跨域和跨主题设置的文本检索。通过广泛的分析,我们表明该策略可以通过基于微调的检索方法来减轻两个问题 - 参数 - 信息和弱推广性。值得注意的是,它可以显着改善检索模型的零零弹性概括。通过仅更新模型参数的0.1%,及时调整策略可以帮助检索模型获得比所有参数更新的传统方法更好的概括性能。最后,为了促进回猎犬的跨主题概括性的研究,我们策划并发布了一个学术检索数据集,其中包含18K查询的87个主题,使其成为迄今为止特定于特定于主题的主题。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
最近的研究表明,通过梯度下降训练的无限宽神经网络(NN)的动态可以是神经切线核(NTK)\ CITEP {Jacot2018neural}的特征。在平方损失下,通过梯度下降训练的无限宽度NN,具有无限小的学习速率等同于与NTK \ CITEP {arora2019Exact}的内核回归。但是,当前ridge回归{arora2019Harnessing}只知道等价物,而NN和其他内核机(KMS)之间的等价,例如,支持向量机(SVM),仍然未知。因此,在这项工作中,我们建议在NN和SVM之间建立等效,具体而言,通过柔软的边缘损失和具有由子润发性培训的NTK培训的标准柔软裕度SVM培训的无限宽NN。我们的主要理论结果包括建立NN和广泛的$ \ ELL_2 $正规化KMS之间的等价,其中有限宽度界限,不能通过事先工作来处理,并显示出通过这种正规化损耗函数训练的每个有限宽度NN大约一公里。此外,我们展示了我们的理论可以实现三种实际应用,包括(i)\ yressit {非空心}通过相应的km界限Nn; (ii)无限宽度NN的\ yryit {非琐碎}鲁棒性证书(而现有的鲁棒性验证方法提供空中界定); (iii)本质上更强大的无限宽度NN,来自以前的内核回归。我们的实验代码可用于\ URL {https://github.com/leslie-ch/equiv-nn-svm}。
translated by 谷歌翻译
Fine-tuning pre-trained models has been ubiquitously proven to be effective in a wide range of NLP tasks. However, fine-tuning the whole model is parameter inefficient as it always yields an entirely new model for each task. Currently, many research works propose to only fine-tune a small portion of the parameters while keeping most of the parameters shared across different tasks. These methods achieve surprisingly good performance and are shown to be more stable than their corresponding fully fine-tuned counterparts. However, such kind of methods is still not well understood. Some natural questions arise: How does the parameter sparsity lead to promising performance? Why is the model more stable than the fully fine-tuned models? How to choose the tunable parameters? In this paper, we first categorize the existing methods into random approaches, rule-based approaches, and projection-based approaches based on how they choose which parameters to tune. Then, we show that all of the methods are actually sparse fine-tuned models and conduct a novel theoretical analysis of them. We indicate that the sparsity is actually imposing a regularization on the original model by controlling the upper bound of the stability. Such stability leads to better generalization capability which has been empirically observed in a lot of recent research works. Despite the effectiveness of sparsity grounded by our theory, it still remains an open problem of how to choose the tunable parameters. To better choose the tunable parameters, we propose a novel Second-order Approximation Method (SAM) which approximates the original problem with an analytically solvable optimization function. The tunable parameters are determined by directly optimizing the approximation function. The experimental results show that our proposed SAM model outperforms many strong baseline models and it also verifies our theoretical analysis.
translated by 谷歌翻译
Prompt learning recently become an effective linguistic tool to motivate the PLMs' knowledge on few-shot-setting tasks. However, studies have shown the lack of robustness still exists in prompt learning, since suitable initialization of continuous prompt and expert-first manual prompt are essential in fine-tuning process. What is more, human also utilize their comparative ability to motivate their existing knowledge for distinguishing different examples. Motivated by this, we explore how to use contrastive samples to strengthen prompt learning. In detail, we first propose our model ConsPrompt combining with prompt encoding network, contrastive sampling module, and contrastive scoring module. Subsequently, two sampling strategies, similarity-based and label-based strategies, are introduced to realize differential contrastive learning. The effectiveness of proposed ConsPrompt is demonstrated in five different few-shot learning tasks and shown the similarity-based sampling strategy is more effective than label-based in combining contrastive learning. Our results also exhibits the state-of-the-art performance and robustness in different few-shot settings, which proves that the ConsPrompt could be assumed as a better knowledge probe to motivate PLMs.
translated by 谷歌翻译
Performance of spoken language understanding (SLU) can be degraded with automatic speech recognition (ASR) errors. We propose a novel approach to improve SLU robustness by randomly corrupting clean training text with an ASR error simulator, followed by self-correcting the errors and minimizing the target classification loss in a joint manner. In the proposed error simulator, we leverage confusion networks generated from an ASR decoder without human transcriptions to generate a variety of error patterns for model training. We evaluate our approach on the DSTC10 challenge targeted for knowledge-grounded task-oriented conversational dialogues with ASR errors. Experimental results show the effectiveness of our proposed approach, boosting the knowledge-seeking turn detection (KTD) F1 significantly from 0.9433 to 0.9904. Knowledge cluster classification is boosted from 0.7924 to 0.9333 in Recall@1. After knowledge document re-ranking, our approach shows significant improvement in all knowledge selection metrics, from 0.7358 to 0.7806 in Recall@1, from 0.8301 to 0.9333 in Recall@5, and from 0.7798 to 0.8460 in MRR@5 on the test set. In the recent DSTC10 evaluation, our approach demonstrates significant improvement in knowledge selection, boosting Recall@1 from 0.495 to 0.7144 compared to the official baseline. Our source code is released in GitHub https://github.com/yctam/dstc10_track2_task2.git.
translated by 谷歌翻译
Causal language modeling (LM) uses word history to predict the next word. BERT, on the other hand, makes use of bi-directional word information in a sentence to predict words at masked positions. While BERT is effective in sequence encoding, it is non-causal by nature and is not designed for sequence generation. In this paper, we propose a novel language model, SUffix REtrieval-Augmented LM (SUREALM), that simulates a bi-directional contextual effect in an autoregressive manner. SUREALM employs an embedding retriever to search for training sentences in a data store that share similar word history during sequence generation. In particular, the suffix portions of the retrieved sentences mimick the "future" context. We evaluated our proposed model on the DSTC9 spoken dialogue corpus and showed promising word perplexity reduction on the validation and test set compared to competitive baselines.
translated by 谷歌翻译
In this paper, we present a robust and low complexity deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the scene of a remote sensing image. In particular, we firstly evaluate different low complexity and benchmark deep neural networks: MobileNetV1, MobileNetV2, NASNetMobile, and EfficientNetB0, which present the number of trainable parameters lower than 5 Million (M). After indicating best network architecture, we further improve the network performance by applying attention schemes to multiple feature maps extracted from middle layers of the network. To deal with the issue of increasing the model footprint as using attention schemes, we apply the quantization technique to satisfies the number trainable parameter of the model lower than 5 M. By conducting extensive experiments on the benchmark datasets NWPU-RESISC45, we achieve a robust and low-complexity model, which is very competitive to the state-of-the-art systems and potential for real-life applications on edge devices.
translated by 谷歌翻译
We formulate grasp learning as a neural field and present Neural Grasp Distance Fields (NGDF). Here, the input is a 6D pose of a robot end effector and output is a distance to a continuous manifold of valid grasps for an object. In contrast to current approaches that predict a set of discrete candidate grasps, the distance-based NGDF representation is easily interpreted as a cost, and minimizing this cost produces a successful grasp pose. This grasp distance cost can be incorporated directly into a trajectory optimizer for joint optimization with other costs such as trajectory smoothness and collision avoidance. During optimization, as the various costs are balanced and minimized, the grasp target is allowed to smoothly vary, as the learned grasp field is continuous. In simulation benchmarks with a Franka arm, we find that joint grasping and planning with NGDF outperforms baselines by 63% execution success while generalizing to unseen query poses and unseen object shapes. Project page: https://sites.google.com/view/neural-grasp-distance-fields.
translated by 谷歌翻译
随着编码器架构的开发,研究人员能够使用更广泛的数据来研究文本生成任务。其中,KB到文本旨在将一组知识三元转换为人类可读句子。在原始设置中,任务假定输入三元和文本是从具体知识/信息的角度进行对齐的。在本文中,我们扩展了此设置,并探讨了如何促进训练的模型以生成更有信息的文本,即包含有关三重实体但未通过输入三元组传达的更多信息。为了解决这个问题,我们提出了一种新型的内存增强发电机,该发电机采用存储网络来记住培训期间学到的有用知识,并利用此类信息以及输入三元组在操作或测试阶段生成文本。我们从WebNLG中得出一个数据集,以进行新的环境,并进行广泛的实验,以研究我们的模型的有效性以及发现设置的内在特征。
translated by 谷歌翻译