Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.
translated by 谷歌翻译
Recently, there has been significant progress in teaching language models to perform step-by-step reasoning to solve complex numerical reasoning tasks. Chain-of-thoughts prompting (CoT) is by far the state-of-art method for these tasks. CoT uses language models to perform both reasoning and computation in the multi-step `thought' process. To disentangle computation from reasoning, we propose `Program of Thoughts' (PoT), which uses language models (mainly Codex) to express the reasoning process as a program. The computation is relegated to an external computer, which executes the generated programs to derive the answer. We evaluate PoT on five math word problem datasets (GSM, AQuA, SVAMP, TabMWP, MultiArith) and three financial-QA datasets (FinQA, ConvFinQA, TATQA) for both few-shot and zero-shot setups. Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12\% across all the evaluated datasets. By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets. All of our data and code are released in Github\footnote{\url{https://github.com/wenhuchen/Program-of-Thoughts}}.
translated by 谷歌翻译
关于文本到图像生成的研究在产生多样化和照片现实的图像方面取得了重大进展,这是由在大规模图像文本数据上训练的扩散和自动回归模型驱动的。尽管最先进的模型可以产生共同实体的高质量图像,但它们通常很难产生不常见的实体的图像,例如“ chortai(dog)”或“ picarones(食物)”。为了解决此问题,我们介绍了检索型的文本对图像生成器(Re-Imagen),这是一种生成模型,它使用检索到的信息来产生高保真和忠实的图像,即使对于稀有或看不见的实体也是如此。给定文本提示,重新构造访问外部多模式知识库以检索相关(图像,文本)对,并将它们用作引用来生成图像。通过此检索步骤,重新构造的知识是对上述实体的高级语义和低级视觉细节的了解,从而提高了其在产生实体视觉外观的准确性。我们在包含(图像,文本,检索)的构造数据集上训练Re-Imagen,以教导该模型在文本提示和检索上扎根。此外,我们制定了一种新的抽样策略,以使文本和检索条件的无分类指南交流,以平衡文本和检索对齐。 Re-Imagen在两个图像生成基准上获得了新的SOTA FID结果,例如Coco(IE,FID = 5.25)和Wikiimage(即FID = 5.82),而无需微调。为了进一步评估该模型的功能,我们介绍了EntityDrawBench,这是一种新的基准测试,可评估从多个视觉域的各种实体的图像生成,从频繁到稀有。人类对EntityDrawBench的评估表明,Re-Imagen与照片现实主义中最好的先前模型相同,但具有明显的忠诚,尤其是在较不频繁的实体上。
translated by 谷歌翻译
在该职位论文中,我们提出了一种新方法,以基于问题的产生和实体链接来生成文本的知识库(KB)。我们认为,所提出的KB类型具有传统符号KB的许多关键优势:尤其是由小型模块化组件组成,可以在组合上合并以回答复杂的查询,包括涉及“多跳跃”的关系查询和查询。“推论。但是,与传统的KB不同,该信息商店与常见的用户信息需求相符。
translated by 谷歌翻译
在宣传,新闻和社交媒体中的虚假,不准确和误导信息中,现实世界的问题应答(QA)系统面临综合和推理相互矛盾的挑战,以获得正确答案的挑战。这种紧迫性导致需要使QA系统对错误信息的强大,这是一个先前未开发的主题。我们通过调查与实际和虚假信息混合的矛盾的情况下,通过调查QA模型的行为来研究对QA模型的错误信息的风险。我们为此问题创建了第一个大规模数据集,即对QA,其中包含超过10K的人写和模型生成的矛盾的上下文。实验表明,QA模型易受误导的背景下的攻击。为了防御这种威胁,我们建立一个错误信息感知的QA系统作为一个反措施,可以以联合方式整合问题应答和错误信息检测。
translated by 谷歌翻译
Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dotproduct self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L) 2 ) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and realworld datasets show that it compares favorably to the state-of-the-art.
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译