As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Variational autoencoders (VAEs) are powerful tools for learning latent representations of data used in a wide range of applications. In practice, VAEs usually require multiple training rounds to choose the amount of information the latent variable should retain. This trade-off between the reconstruction error (distortion) and the KL divergence (rate) is typically parameterized by a hyperparameter $\beta$. In this paper, we introduce Multi-Rate VAE (MR-VAE), a computationally efficient framework for learning optimal parameters corresponding to various $\beta$ in a single training run. The key idea is to explicitly formulate a response function that maps $\beta$ to the optimal parameters using hypernetworks. MR-VAEs construct a compact response hypernetwork where the pre-activations are conditionally gated based on $\beta$. We justify the proposed architecture by analyzing linear VAEs and showing that it can represent response functions exactly for linear VAEs. With the learned hypernetwork, MR-VAEs can construct the rate-distortion curve without additional training and can be deployed with significantly less hyperparameter tuning. Empirically, our approach is competitive and often exceeds the performance of multiple $\beta$-VAEs training with minimal computation and memory overheads.
translated by 谷歌翻译
预训练模型已在许多代码智能任务中有效。这些模型在大规模未标记的语料库中进行了预训练,然后在下游任务中进行了微调。但是,由于预训练和下游任务的输入是不同的形式,因此很难充分探索预训练模型的知识。此外,微调的性能强烈依赖于下游数据的量,而实际上,具有稀缺数据的场景很常见。自然语言处理(NLP)领域的最新研究表明,迅速调整,一种调整的新范式,减轻上述问题并在各种NLP任务中实现了有希望的结果。在迅速调整中,在调整过程中插入的提示提供了特定于任务的知识,这对于具有相对较少数据的任务特别有益。在本文中,我们凭经验评估了代码智能任务中迅速调整的用法和效果。我们对流行的预训练模型Codebert和codet5进行及时调整,并尝试三个代码智能任务,包括缺陷预测,代码摘要和代码翻译。我们的实验结果表明,在所有三个任务中,迅速调整始终优于微调。此外,及时调整在低资源场景中显示出很大的潜力,例如,对于代码摘要,平均将微调的BLEU分数提高了26%以上。我们的结果表明,我们可以调整代码智能任务的迅速调整,以实现更好的性能,尤其是在缺乏特定于任务的数据时,我们可以调整及时调整。
translated by 谷歌翻译
我们介绍了Proteus,这是一种新型的自设计近似范围滤波器,它根据采样数据进行配置,以便针对给定的空间要求优化其误报率(FPR)。 Proteus统一了最先进的范围过滤器的概率和确定性设计空间,以在较大的用例中实现稳健的性能。 Proteus的核心是我们的上下文前缀FPR(CPFPR)模型 - 在其设计空间中基于前缀过滤器的FPR的正式框架。我们从经验上证明了模型和Proteus在合成工作负载和现实世界数据集上优化的能力的准确性。我们进一步评估了RockSDB中的Proteus,并表明它能够将端到端的性能提高到5.3倍,而不是更脆的先进方法,例如Surf和Rosetta。我们的实验还表明,与端到端的性能增长相比,建模的成本并不显着,并且Proteus对工作负载转移的稳定性。
translated by 谷歌翻译
通常通过过去的选择来告知机器学习中的评估,例如要使用哪些数据集或指标。该标准化可以使用排行榜对平等基础进行比较,但是随着出现更好的替代方案,评估选择变得不佳。这个问题在自然语言生成中尤其相关,该语言需要不断改善的数据集,指标和人类评估以提出确定性的主张。为了使遵循最佳模型评估实践更加容易,我们介绍了GEMV2。新版本的一代,评估和指标基准为数据集,模型和指标开发人员提供了模块化基础架构,以使彼此受益。GEMV2支持40种记录的数据集中51种语言。所有数据集的模型都可以在线评估,我们的交互式数据卡创建和渲染工具使得在Living Benchmark中添加新数据集变得更加容易。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
人工智能(AI)为简化Covid-19诊断提供了有前景的替代。然而,涉及周围的安全和可信度的担忧阻碍了大规模代表性的医学数据,对临床实践中训练广泛的模型造成了相当大的挑战。为了解决这个问题,我们启动了统一的CT-Covid AI诊断计划(UCADI),其中AI模型可以在没有数据共享的联合学习框架(FL)下在每个主机机构下分发和独立地在没有数据共享的情况下在每个主机机构上执行。在这里,我们认为我们的FL模型通过大的产量(中国测试敏感性/特异性:0.973 / 0.951,英国:0.730 / 0.942),与专业放射科医师的面板实现可比性表现。我们进一步评估了持有的模型(从另外两家医院收集,留出FL)和异构(用造影材料获取)数据,提供了模型所做的决策的视觉解释,并分析了模型之间的权衡联邦培训过程中的性能和沟通成本。我们的研究基于来自位于中国和英国的23家医院的3,336名患者的9,573次胸部计算断层扫描扫描(CTS)。统称,我们的工作提出了利用联邦学习的潜在保留了数字健康的前景。
translated by 谷歌翻译
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of "fast weights" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.
translated by 谷歌翻译
While the capabilities of autonomous systems have been steadily improving in recent years, these systems still struggle to rapidly explore previously unknown environments without the aid of GPS-assisted navigation. The DARPA Subterranean (SubT) Challenge aimed to fast track the development of autonomous exploration systems by evaluating their performance in real-world underground search-and-rescue scenarios. Subterranean environments present a plethora of challenges for robotic systems, such as limited communications, complex topology, visually-degraded sensing, and harsh terrain. The presented solution enables long-term autonomy with minimal human supervision by combining a powerful and independent single-agent autonomy stack, with higher level mission management operating over a flexible mesh network. The autonomy suite deployed on quadruped and wheeled robots was fully independent, freeing the human supervision to loosely supervise the mission and make high-impact strategic decisions. We also discuss lessons learned from fielding our system at the SubT Final Event, relating to vehicle versatility, system adaptability, and re-configurable communications.
translated by 谷歌翻译