当疑问以获得更好的有效精度时,选择性分类允许模型放弃预测(例如,说“我不知道”)。尽管典型的选择性模型平均可以有效地产生更准确的预测,但它们仍可能允许具有很高置信度的错误预测,或者跳过置信度较低的正确预测。提供校准的不确定性估计以及预测(与真实频率相对应的概率)以及具有平均准确的预测一样重要。但是,不确定性估计对于某些输入可能不可靠。在本文中,我们开发了一种新的选择性分类方法,其中我们提出了一种拒绝“不确定”不确定性的示例的方法。通过这样做,我们旨在通过对所接受示例的分布进行{良好校准}的不确定性估计进行预测,这是我们称为选择性校准的属性。我们提出了一个用于学习选择性校准模型的框架,其中训练了单独的选择器网络以改善给定基本模型的选择性校准误差。特别是,我们的工作重点是实现强大的校准,该校准有意地设计为在室外数据上进行测试。我们通过受分配强大的优化启发的训练策略实现了这一目标,在该策略中,我们将模拟输入扰动应用于已知的,内域培训数据。我们证明了方法对多个图像分类和肺癌风险评估任务的经验有效性。
translated by 谷歌翻译
我们扩展了共形预测以控制任何单调损耗函数的预期值。该算法将共形预测与其覆盖范围保证一起概括。像共构预测一样,共形风险控制程序紧密到$ \ Mathcal {o}(1/n)$ factor。来自计算机视觉和自然语言处理的工作示例表明,我们的算法使用了虚假的负率,图形距离和令牌级别的F1得分。
translated by 谷歌翻译
基于变压器的大语言模型(LLM)的最新进展已导致许多任务的性能改进。这些收益随着模型的大小而大幅增加,可能导致推理时间缓慢且昂贵的使用。但是,实际上,LLMS制造的一代人由不同的难度组成。尽管某些预测确实从模型的全部容量中受益,但其他延续更为微不足道,可以通过减少的计算来解决。在这项工作中,我们介绍了自信的自适应语言建模(平静),该框架用于动态分配每个输入和生成时间段的不同计算。提前退出解码涉及我们在这里解决的几个挑战,例如:(1)使用什么信心措施; (2)将序列级别的约束连接到局部人口退出决策; (3)由于以前的令牌中的早期退出而返回丢失的隐藏表示形式。通过对三个不同文本生成任务的理论分析和经验实验,我们证明了框架在减少计算的效果 - 潜在的速度最高为$ \ times 3 $ - 同时可维持高性能。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF-better few-shot fine-tuning of language models 1 -a suite of simple and complementary techniques for finetuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning. 2 * The first two authors contributed equally. 1 Alternatively, language models' best friends forever. 2 Our implementation is publicly available at https:// github.com/princeton-nlp/LM-BFF.
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
大多数实时人类姿势估计方法都基于检测接头位置。使用检测到的关节位置,可以计算偏差和肢体的俯仰。然而,由于这种旋转轴仍然不观察,因此不能计算沿着肢体沿着肢体至关重要的曲折,这对于诸如体育分析和计算机动画至关重要。在本文中,我们引入了方向关键点,一种用于估计骨骼关节的全位置和旋转的新方法,仅使用单帧RGB图像。灵感来自Motion-Capture Systems如何使用一组点标记来估计全骨骼旋转,我们的方法使用虚拟标记来生成足够的信息,以便准确地推断使用简单的后处理。旋转预测改善了接头角度最佳报告的平均误差48%,并且在15个骨骼旋转中实现了93%的精度。该方法还通过MPJPE在原理数据集上测量,通过MPJPE测量,该方法还改善了当前的最新结果14%,并概括为野外数据集。
translated by 谷歌翻译
Quantitative cancer image analysis relies on the accurate delineation of tumours, a very specialised and time-consuming task. For this reason, methods for automated segmentation of tumours in medical imaging have been extensively developed in recent years, being Computed Tomography one of the most popular imaging modalities explored. However, the large amount of 3D voxels in a typical scan is prohibitive for the entire volume to be analysed at once in conventional hardware. To overcome this issue, the processes of downsampling and/or resampling are generally implemented when using traditional convolutional neural networks in medical imaging. In this paper, we propose a new methodology that introduces a process of sparsification of the input images and submanifold sparse convolutional networks as an alternative to downsampling. As a proof of concept, we applied this new methodology to Computed Tomography images of renal cancer patients, obtaining performances of segmentations of kidneys and tumours competitive with previous methods (~84.6% Dice similarity coefficient), while achieving a significant improvement in computation time (2-3 min per training epoch).
translated by 谷歌翻译
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text promts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures.
translated by 谷歌翻译
The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been supported through the introduction of deep learning and transformer-based networks, which have boosted the sector significantly in recent years. This paper gives a comprehensive survey of the current techniques and trends in medical summarization
translated by 谷歌翻译