Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. Theoretically, we figure out that the Transformer attention has a dual form of gradient descent based optimization. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. Experimentally, we comprehensively compare the behavior of ICL and explicit finetuning based on real tasks to provide empirical evidence that supports our understanding. The results prove that ICL behaves similarly to explicit finetuning at the prediction level, the representation level, and the attention behavior level. Further, inspired by our understanding of meta-optimization, we design a momentum-based attention by analogy with the momentum-based gradient descent algorithm. Its consistently better performance over vanilla attention supports our understanding again from another aspect, and more importantly, it shows the potential to utilize our understanding for future model designing.
translated by 谷歌翻译
Language models are widely deployed to provide automatic text completion services in user products. However, recent research has revealed that language models (especially large ones) bear considerable risk of memorizing private training data, which is then vulnerable to leakage and extraction by adversaries. In this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text, while varying other factors such as model size and adversarial conditions. We test both "heuristic" mitigations (those without formal privacy guarantees) and Differentially Private training, which provides provable levels of privacy at the cost of some model performance. Our experiments show that (with the exception of L2 regularization), heuristic mitigations are largely ineffective in preventing memorization in our test suite, possibly because they make too strong of assumptions about the characteristics that define "sensitive" or "private" text. In contrast, Differential Privacy reliably prevents memorization in our experiments, despite its computational and model-performance costs.
translated by 谷歌翻译
Despite the fast advances in high-sigma yield analysis with the help of machine learning techniques in the past decade, one of the main challenges, the curse of dimensionality, which is inevitable when dealing with modern large-scale circuits, remains unsolved. To resolve this challenge, we propose an absolute shrinkage deep kernel learning, ASDK, which automatically identifies the dominant process variation parameters in a nonlinear-correlated deep kernel and acts as a surrogate model to emulate the expensive SPICE simulation. To further improve the yield estimation efficiency, we propose a novel maximization of approximated entropy reduction for an efficient model update, which is also enhanced with parallel batch sampling for parallel computing, making it ready for practical deployment. Experiments on SRAM column circuits demonstrate the superiority of ASDK over the state-of-the-art (SOTA) approaches in terms of accuracy and efficiency with up to 10.3x speedup over SOTA methods.
translated by 谷歌翻译
Real-time semantic segmentation has played an important role in intelligent vehicle scenarios. Recently, numerous networks have incorporated information from multi-size receptive fields to facilitate feature extraction in real-time semantic segmentation tasks. However, these methods preferentially adopt massive receptive fields to elicit more contextual information, which may result in inefficient feature extraction. We believe that the elaborated receptive fields are crucial, considering the demand for efficient feature extraction in real-time tasks. Therefore, we propose an effective and efficient architecture termed Dilation-wise Residual segmentation (DWRSeg), which possesses different sets of receptive field sizes within different stages. The architecture involves (i) a Dilation-wise Residual (DWR) module for extracting features based on different scales of receptive fields in the high level of the network; (ii) a Simple Inverted Residual (SIR) module that uses an inverted bottleneck structure to extract features from the low stage; and (iii) a simple fully convolutional network (FCN)-like decoder for aggregating multiscale feature maps to generate the prediction. Extensive experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness of our method by achieving a state-of-the-art trade-off between accuracy and inference speed, in addition to being lighter weight. Without using pretraining or resorting to any training trick, we achieve 72.7% mIoU on the Cityscapes test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods. The code and trained models are publicly available.
translated by 谷歌翻译
The task of response selection in multi-turn dialogue is to find the best option from all candidates. In order to improve the reasoning ability of the model, previous studies pay more attention to using explicit algorithms to model the dependencies between utterances, which are deterministic, limited and inflexible. In addition, few studies consider differences between the options before and after reasoning. In this paper, we propose an Implicit Relational Reasoning Graph Network to address these issues, which consists of the Utterance Relational Reasoner (URR) and the Option Dual Comparator (ODC). URR aims to implicitly extract dependencies between utterances, as well as utterances and options, and make reasoning with relational graph convolutional networks. ODC focuses on perceiving the difference between the options through dual comparison, which can eliminate the interference of the noise options. Experimental results on two multi-turn dialogue reasoning benchmark datasets MuTual and MuTual+ show that our method significantly improves the baseline of four pretrained language models and achieves state-of-the-art performance. The model surpasses human performance for the first time on the MuTual dataset.
translated by 谷歌翻译
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
translated by 谷歌翻译
在神经影像分析中,功能磁共振成像(fMRI)可以很好地评估没有明显结构病变的脑疾病的大脑功能变化。到目前为止,大多数基于研究的FMRI研究将功能连接性作为疾病分类的基本特征。但是,功能连接通常是根据感兴趣的预定义区域的时间序列计算的,并忽略了每个体素中包含的详细信息,这可能会导致诊断模型的性能恶化。另一个方法论上的缺点是训练深模型的样本量有限。在这项研究中,我们提出了Brainformer,这是一种用于单个FMRI体积的脑疾病分类的一般混合变压器架构,以充分利用素食细节,并具有足够的数据尺寸和尺寸。脑形形式是通过对每个体素内的局部提示进行建模的3D卷积,并捕获两个全球注意力障碍的遥远地区之间的全球关系。局部和全局线索通过单流模型在脑形中汇总。为了处理多站点数据,我们提出了一个归一化层,以将数据标准化为相同的分布。最后,利用一种基于梯度的定位图可视化方法来定位可能的疾病相关生物标志物。我们在五个独立获取的数据集上评估了脑形形成器,包括Abide,ADNI,MPILMBB,ADHD-200和ECHO,以及自闭症疾病,阿尔茨海默氏病,抑郁症,注意力缺陷多动障碍和头痛疾病。结果证明了脑形对多种脑疾病的诊断的有效性和普遍性。脑形物可以在临床实践中促进基于神经成像的精确诊断,并激励FMRI分析中的未来研究。代码可在以下网址获得:https://github.com/ziyaozhangforpcl/brainformer。
translated by 谷歌翻译
在这份技术报告中,我们介绍了数字写作助手(高效且智能编辑),该助手通过使用人工智能(AI)技术来促进用户更有效地编写更高质量的文本。以前的写作助理通常提供错误检查的功能(以检测和纠正拼写和语法错误)和有限的文本练习功能。随着大型神经语言模型的出现,一些系统支持自动完成句子或段落。在Effidit中,我们通过提供五个类别的功能来显着扩展写作助手的能力:文本完成,错误检查,文本抛光,关键字到句子(K2S)和云输入方法(Cloud IME)。在文本完成类别中,Effidit支持基于生成的句子完成,基于检索的句子完成和短语完成。相比之下,到目前为止,许多其他写作助理仅提供三个功能中的一两个。对于文本抛光,我们具有三个函数:(上下文感知)短语抛光,句子释义和句子扩展,而其他许多写作助手通常会在此类别中支持一两个功能。本报告的主要内容包括象征的主要模块,实施这些模块的方法以及一些关键方法的评估结果。
translated by 谷歌翻译
动态面部表达识别(FER)数据库为情感计算和应用提供了重要的数据支持。但是,大多数FER数据库都用几个基本的相互排斥性类别注释,并且仅包含一种模式,例如视频。单调的标签和模式无法准确模仿人类的情绪并实现现实世界中的应用。在本文中,我们提出了MAFW,这是一个大型多模式复合情感数据库,野外有10,045个视频Audio剪辑。每个剪辑都有一个复合的情感类别和几个句子,这些句子描述了剪辑中受试者的情感行为。对于复合情绪注释,每个剪辑都被归类为11种广泛使用的情绪中的一个或多个,即愤怒,厌恶,恐惧,幸福,中立,悲伤,惊喜,蔑视,焦虑,焦虑,无助和失望。为了确保标签的高质量,我们通过预期最大化(EM)算法来滤除不可靠的注释,然后获得11个单标签情绪类别和32个多标签情绪类别。据我们所知,MAFW是第一个带有复合情感注释和与情感相关的字幕的野外多模式数据库。此外,我们还提出了一种新型的基于变压器的表达片段特征学习方法,以识别利用不同情绪和方式之间表达变化关系的复合情绪。在MAFW数据库上进行的广泛实验显示了所提出方法的优势,而不是其他最先进的方法对单型和多模式FER的优势。我们的MAFW数据库可从https://mafw-database.github.io/mafw公开获得。
translated by 谷歌翻译
对比视力语言预训练(称为剪辑)为使用大型图像文本对学习视觉表示提供了新的范式。通过零拍知识转移,它在下游任务上表现出令人印象深刻的表现。为了进一步增强剪辑的适应能力,现有的方法提议微调额外的可学习模块,这大大改善了少量的性能,但引入了额外的培训时间和计算资源。在本文中,我们提出了一种无训练的适应方法,用于进行剪辑进行几个弹药分类,称为Tip-Adapter,该分类不仅继承了零拍剪辑的无训练优势,而且还与训练需要的那些相当的表现相当方法。 TIP-ADAPTER通过少数照片训练集通过键值缓存模型构造适配器,并更新通过功能检索中剪辑中编码的先验知识。最重要的是,可以通过对10 $ \ times $ \现有方法少的速度$ \ times $ $ \现有方法进行微调,这可以进一步提高Imagenet上的最先进。高效的。我们在11个数据集上进行了很少的射击分类实验,以证明我们提出的方法的优势。代码在https://github.com/gaopengcuhk/tip-adapter上发布。
translated by 谷歌翻译