近年来,低资源机器阅读理解(MRC)取得了重大进展,模型在各种语言数据集中获得了显着性能。但是,这些模型都没有为URDU语言定制。这项工作探讨了通过将机器翻译的队伍与来自剑桥O级书籍的Wikipedia文章和Urdu RC工作表组合的人生成的样本组合了机器翻译的小队,探讨了乌尔通题的半自动创建了数据集(UQuad1.0)。 UQuad1.0是一个大型URDU数据集,用于提取机器阅读理解任务,由49K问题答案成对组成,段落和回答格式。在UQuad1.0中,通过众包的原始SquAd1.0和大约4000对的机器翻译产生45000对QA。在本研究中,我们使用了两种类型的MRC型号:基于规则的基线和基于先进的变换器的模型。但是,我们发现后者优于其他人;因此,我们已经决定专注于基于变压器的架构。使用XLMroberta和多语言伯特,我们分别获得0.66和0.63的F1得分。
translated by 谷歌翻译
问题答案(QA)是自然语言处理中最具挑战性的最具挑战性的问题之一(NLP)。问答(QA)系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此,QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言,调查最先进的技术和实现最先进的结果。然而,由于阿拉伯QA中的研究努力和缺乏大型基准数据集,在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型,它是阿拉伯语 - 队,ArcD,AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型,ArabertV0.2大型型号和ARAElectra模型的性能。在最后,我们提供了一个分析,了解和解释某些型号获得的低绩效结果。
translated by 谷歌翻译
有关应答数据集和模型的研究在研究界中获得了很多关注。其中许多人释放了自己的问题应答数据集以及模型。我们在该研究领域看到了巨大的进展。本调查的目的是识别,总结和分析许多研究人员释放的现有数据集,尤其是在非英语数据集以及研究代码和评估指标等资源中。在本文中,我们审查了问题应答数据集,这些数据集可以以法语,德语,日语,中文,阿拉伯语,俄语以及多语言和交叉的问答数据集进行英语。
translated by 谷歌翻译
问题回答(QA)是信息检索和信息提取领域内的一项自然理解任务,由于基于机器阅读理解的模型的强劲发展,近年来,近年来,近年来的计算语言学和人工智能研究社区引起了很多关注。基于读者的质量检查系统是一种高级搜索引擎,可以使用机器阅读理解(MRC)技术在开放域或特定领域特定文本中找到正确的查询或问题的答案。 MRC和QA系统中的数据资源和机器学习方法的大多数进步尤其是在两种资源丰富的语言中显着开发的,例如英语和中文。像越南人这样的低资源语言见证了关于质量检查系统的稀缺研究。本文介绍了XLMRQA,这是第一个在基于Wikipedia的文本知识源(使用UIT-Viquad语料库)上使用基于变压器的读取器的越南质量检查系统,使用深​​层神经网络模型优于DRQA和BERTSERINI,优于两个可靠的QA系统分别为24.46%和6.28%。从三个系统获得的结果中,我们分析了问题类型对质量检查系统性能的影响。
translated by 谷歌翻译
一种有效的横向传输方法是在一种语言中微调在监督数据集上的双语或多语言模型,并以零拍方式在另一种语言上进行评估。在培训时间或推理时间翻译例子也是可行的替代方案。然而,存在与文献中很少有关的这些方法相关的成本。在这项工作中,我们在其有效性(例如,准确性),开发和部署成本方面分析交叉语言方法,以及推理时间的延迟。我们的三个任务的实验表明最好的交叉方法是高度任务依赖性的。最后,通过结合零射和翻译方法,我们在这项工作中使用的三个数据集中实现了最先进的。基于这些结果,我们对目标语言手动标记的培训数据有所了解。代码和翻译的数据集可在https://github.com/unicamp-dl/cross-lingsual-analysis上获得
translated by 谷歌翻译
为英语以外的其他语言构建有效的开放式问题答案(开放质量质量质量)系统可能是具有挑战性的,这主要是由于缺乏标记的培训数据。我们提出了一种数据有效的方法来引导此类系统,以使用英语以外的其他语言。我们的方法只需要在给定语言中,以及机器翻译的数据以及至少一个双语语言模型中有限的质量检查资源。为了评估我们的方法,我们为冰岛语言构建了这样的系统,并评估了Trivia风格数据集的性能。用于培训的语料库是英语的,但机器被翻译成冰岛。我们训练双语的冰岛/英语模型,以嵌入英语背景和冰岛问题,并在用密集酶引入的方法之后(Lee等,2021)。最终的系统是冰岛和英语之间的开放式域杂志系统。最后,该系统适用于仅冰岛开放式质量检查,以说明如何有效地创建开放的QA系统,而使用感兴趣的语言对策划数据集的访问有限。
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
对于许多任务,基于变压器的体系结构已经实现了最新的结果,从而导致实践从使用特定于任务的架构到预先训练的语言模型的微调。持续的趋势包括具有越来越多的数据和参数的培训模型,这需要大量资源。它导致了强有力的搜索,以提高基于仅针对英语评估的算法和硬件改进的算法和硬件改进。这引发了有关其可用性的疑问,当应用于小规模的学习问题时,对于资源不足的语言任务,有限的培训数据可用。缺乏适当尺寸的语料库是应用数据驱动和转移学习的方法的障碍。在本文中,我们建立了致力于基于变压器模型的可用性的最新努力,并建议评估这些改进的法语表现,而法语的效果很少。我们通过通过数据增强,超参数优化和跨语性转移来调查各种培训策略来解决与数据稀缺有关的不稳定。我们还为法国弗拉伯特(Fralbert)引入了一种新的紧凑型模型,该模型在低资源环境中被证明具有竞争力。
translated by 谷歌翻译
The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.
translated by 谷歌翻译
过去十年互联网上可用的信息和信息量增加。该数字化导致自动应答系统需要从冗余和过渡知识源中提取富有成效的信息。这些系统旨在利用自然语言理解(NLU)从此巨型知识源到用户查询中最突出的答案,从而取决于问题答案(QA)字段。问题答案涉及但不限于用户问题映射的步骤,以获取相关查询,检索相关信息,从检索到的信息等找到最佳合适的答案等。当前对深度学习模型的当前改进估计所有这些任务的令人信服的性能改进。在本综述工作中,根据问题的类型,答案类型,证据答案来源和建模方法进行分析QA场的研究方向。此细节随后是自动问题生成,相似性检测和语言的低资源可用性等领域的开放挑战。最后,提出了对可用数据集和评估措施的调查。
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译
我们介绍了关于多语言信息访问(MIA)2022共享任务的研讨会的结果,评估了16种类型上多样性的语言中的跨语性开放回程答案(QA)系统。在此任务中,我们在14种类型上多样化的语言中调整了两个大规模的跨语性开放式质疑QA数据集,并使用了2种代表性不足的语言中的新注释的开放式QA数据:Tagalog和Tamil。四个团队提交了他们的系统。利用迭代开采的最佳系统是不同的负面示例和较大的预审慎模型达到32.2 F1,表现优于我们的基线4.5分。第二最佳系统使用实体感知的上下文化表示文档检索,并在泰米尔语(20.8 F1)方面取得了重大改进,而其他大多数系统的得分几乎为零。
translated by 谷歌翻译
为了实现长文档理解的构建和测试模型,我们引入质量,具有中文段的多项选择QA DataSet,具有约5,000个令牌的平均长度,比典型的当前模型更长。与经过段落的事先工作不同,我们的问题是由阅读整个段落的贡献者编写和验证的,而不是依赖摘要或摘录。此外,只有一半的问题是通过在紧缩时间限制下工作的注释器来应答,表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳(55.4%),并且落后于人类性能(93.5%)。
translated by 谷歌翻译
We present SpaceQA, to the best of our knowledge the first open-domain QA system in Space mission design. SpaceQA is part of an initiative by the European Space Agency (ESA) to facilitate the access, sharing and reuse of information about Space mission design within the agency and with the public. We adopt a state-of-the-art architecture consisting of a dense retriever and a neural reader and opt for an approach based on transfer learning rather than fine-tuning due to the lack of domain-specific annotated data. Our evaluation on a test set produced by ESA is largely consistent with the results originally reported by the evaluated retrievers and confirms the need of fine tuning for reading comprehension. As of writing this paper, ESA is piloting SpaceQA internally.
translated by 谷歌翻译
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
当前在提取问题答案(EQA)中进行的研究对单跨度提取设置进行了建模,其中单个答案跨度是可以预测给定问题对对的标签。对于通用域EQA来说,这种设置是自然的,因为可以单个跨度可以回答通用域中的大多数问题。遵循通用域EQA模型,当前的生物医学EQA(BIOEQA)模型利用单跨度提取设置,采用后处理步骤。在本文中,我们调查了整个普通和生物医学领域的问题分布,发现生物医学问题更可能需要列表型答案(多个答案),而不是Factoid-type答案(单个答案)。这需要能够为问题提供多个答案的模型。基于这项初步研究,我们为Bioeqa提出了一种序列标记方法,Bioeqa是一种多跨度提取设置。我们的方法直接以不同数量的短语作为答案来解决问题,并可以学会从培训数据中确定问题的答案数量。我们在BioASQ 7B和8B列表类型问题上的实验结果优于表现最佳的现有模型,而无需进行后处理步骤。源代码和资源可免费下载,网址为https://github.com/dmis-lab/seqtagqa
translated by 谷歌翻译
阅读理解(RC)是从给定的段落或一组段落回答问题的任务。在多个段落的情况下,任务是找到问题的最佳答案。最近在自然语言处理领域(NLP)的试验和实验证明,机器可以提供不仅可以在文章中处理文本的能力,并了解其含义来回答该段落的问题,而且可以超越在许多数据集中的人类性能,例如Standford的问题应答数据集(班德)。本文在过去几十年中提出了对自然语言处理的阅读理解及其演变的研究。我们还应研究单一文件阅读理解的任务如何作为我们的多文件阅读理解系统的构建块。在本文的后半部分中,我们将研究最近提出的多文件阅读理解模型 - Re3Q,由读者,检索器和基于RA-Ranker的网络组成,以获取最佳答案给定的一组段落。
translated by 谷歌翻译