Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life. The development of artificial intelligence (AI) systems capable of solving math problems and proving theorems has garnered significant interest in the fields of machine learning and natural language processing. For example, mathematics serves as a testbed for aspects of reasoning that are challenging for powerful deep learning models, driving new algorithmic and modeling advances. On the other hand, recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning. In this survey paper, we review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. We also evaluate existing benchmarks and methods, and discuss future research directions in this domain.
translated by 谷歌翻译
Neural models that do not rely on pre-training have excelled in the keyphrase generation task with large annotated datasets. Meanwhile, new approaches have incorporated pre-trained language models (PLMs) for their data efficiency. However, there lacks a systematic study of how the two types of approaches compare and how different design choices can affect the performance of PLM-based models. To fill in this knowledge gap and facilitate a more informed use of PLMs for keyphrase extraction and keyphrase generation, we present an in-depth empirical study. Formulating keyphrase extraction as sequence labeling and keyphrase generation as sequence-to-sequence generation, we perform extensive experiments in three domains. After showing that PLMs have competitive high-resource performance and state-of-the-art low-resource performance, we investigate important design choices including in-domain PLMs, PLMs with different pre-training objectives, using PLMs with a parameter budget, and different formulations for present keyphrases. Further results show that (1) in-domain BERT-like PLMs can be used to build strong and data-efficient keyphrase generation models; (2) with a fixed parameter budget, prioritizing model depth over width and allocating more layers in the encoder leads to better encoder-decoder models; and (3) introducing four in-domain PLMs, we achieve a competitive performance in the news domain and the state-of-the-art performance in the scientific domain.
translated by 谷歌翻译
Privacy policies provide individuals with information about their rights and how their personal information is handled. Natural language understanding (NLU) technologies can support individuals and practitioners to understand better privacy practices described in lengthy and complex documents. However, existing efforts that use NLU technologies are limited by processing the language in a way exclusive to a single task focusing on certain privacy practices. To this end, we introduce the Privacy Policy Language Understanding Evaluation (PLUE) benchmark, a multi-task benchmark for evaluating the privacy policy language understanding across various tasks. We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training. We demonstrate that domain-specific pre-training offers performance improvements across all tasks. We release the benchmark to encourage future research in this domain.
translated by 谷歌翻译
From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the "person who needs healing" in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available https://github.com/Hxyou/HumanCog.
translated by 谷歌翻译
In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.
translated by 谷歌翻译
Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.
translated by 谷歌翻译
数学推理是人类智力的核心能力,在抽象思维和逻辑推理中对机器提出了独特的挑战。最近的大型预训练的语言模型(例如GPT-3)在以文本形式(例如数学单词问题(MWP))编写的数学推理任务上取得了显着的进步。但是,未知模型是否可以处理更复杂的问题,这些问题涉及数学推理,例如表格数据。为了填补空白,我们提出了表格数学单词问题(TABMWP),这是一个包含38,431个开放域级等级问题的新数据集,这些问题需要在文本和表格数据上进行数学推理。 TABMWP中的每个问题都与表格上下文对齐,该上下文作为图像,半结构化文本和结构化表。有两种类型的问题:自由文本和多选择,每个问题都用金解决方案注释以揭示多步推理过程。我们在TABMWP上评估了不同的预训练模型,包括在几次设置中的GPT-3模型。正如先前的研究所表明的那样,由于很少有GPT-3依赖于内在的示例的选择,因此其性能是不稳定的,并且可能会降解为几乎机会。处理TABMWP等复杂问题时,不稳定的问题更为严重。为了减轻这种情况,我们进一步提出了一种新颖的方法,即PresspG,该方法利用策略梯度学习从少量培训数据中选择中文示例,然后为测试示例构造相应的提示。实验结果表明,与随机选择相比,我们的方法在准确性度量上优于最佳基线,并显着降低了预测方差,这验证了其在选择性上下文示例中的有效性。
translated by 谷歌翻译
我们提供了一种单发图像合成的方法,该方法可以通过倒置配备有强正规化器的准稳定分类器来控制单个图像的操作。我们提出的标题为“魔术”的方法是从预先训练的准稳定分类器中的结构化梯度,以更好地保留输入语义,同时保留其分类精度,从而确保合成中的信誉。与当前使用复杂原语的当前方法来监督该过程或使用注意图作为弱监督信号,魔术汇总了输入上的梯度,这是由导向二进制掩码驱动的,该导向二进制掩码可以实施强大的空间先验。魔术在一个框架上实现了一系列的操作,以实现形状和位置控制,强烈的非刚性形状变形,并在存在重复对象的情况下复制/移动操作,并通过仅需指定二进制指南掩码来使用户对综合的企业控制。我们的研究和发现得到了与最新图像的各种定性比较,从成像网和使用机器感知进行定量分析的相同图像以及对100多名参与者的用户调查来认可我们的合成质量。
translated by 谷歌翻译
在回答问题时,人类会利用跨不同模式可用的信息来综合一致,完整的思想链(COT)。在深度学习模型(例如大规模语言模型)的情况下,这个过程通常是黑匣子。最近,科学问题基准已用于诊断AI系统的多跳推理能力和解释性。但是,现有数据集无法为答案提供注释,或仅限于仅文本模式,小尺度和有限的域多样性。为此,我们介绍了科学问题答案(SQA),这是一个新的基准,由〜21k的多模式多种选择问题组成,其中包含各种科学主题和答案的注释,并提供相应的讲座和解释。我们进一步设计语言模型,以学习将讲座和解释作为思想链(COT),以模仿回答SQA问题时的多跳上推理过程。 SQA在语言模型中展示了COT的实用性,因为COT将问题的答案绩效提高了1.20%的GPT-3和3.99%的unifiedqa。我们还探索了模型的上限,以通过喂食输入中的那些来利用解释;我们观察到它将GPT-3的少量性能提高了18.96%。我们的分析进一步表明,与人类类似的语言模型受益于解释,从较少的数据中学习并仅使用40%的数据实现相同的性能。
translated by 谷歌翻译
现有的研究表明,对抗性示例可以直接归因于具有高度预测性的非稳态特征的存在,但很容易被对手对愚弄NLP模型进行操纵。在这项研究中,我们探讨了捕获特定于任务的鲁棒特征的可行性,同时使用信息瓶颈理论消除了非舒适的特征。通过广泛的实验,我们表明,通过我们的信息基于瓶颈的方法训练的模型能够在稳健的精度上取得显着提高,超过了所有先前报道的防御方法的性能,而在SST-2上几乎没有遭受清洁准确性的表现下降,Agnews和IMDB数据集。
translated by 谷歌翻译