智能论文笔记

AGReE: A system for generating Automated Grammar Reading Exercises

Sophia Chan , Swapna Somasundaran , Debanjan Ghosh , Mengxuan Zhao

分类：自然语言处理

2022-10-28

We describe the AGReE system, which takes user-submitted passages as input and automatically generates grammar practice exercises that can be completed while reading. Multiple-choice practice items are generated for a variety of different grammar constructs: punctuation, articles, conjunctions, pronouns, prepositions, verbs, and nouns. We also conducted a large-scale human evaluation with around 4,500 multiple-choice practice items. We notice for 95% of items, a majority of raters out of five were able to identify the correct answer and for 85% of cases, raters agree that there is only one correct answer among the choices. Finally, the error analysis shows that raters made the most mistakes for punctuation and conjunctions.

translated by 谷歌翻译

Linguistic Constructs as the Representation of the Domain Model in an Intelligent Language Tutoring System

Anisia Katinskaia , Jue Hou , Anh-Duc Vu , Roman Yangarber

分类：自然语言处理

2022-12-03

This paper presents the development of an AI-based language learning platform Revita. It is a freely available intelligent online tutor, developed to support learners of multiple languages, from low-intermediate to advanced levels. It has been in pilot use by hundreds of students at several universities, whose feedback and needs are shaping the development. One of the main emerging features of Revita is the introduction of a system of linguistic constructs as the representation of domain knowledge. The system of constructs is developed in close collaboration with experts in language teaching. Constructs define the types of exercises, the content of the feedback, and enable the detailed modeling and evaluation of learning progress.

translated by 谷歌翻译

QuALITY: Question Answering with Long Input Texts, Yes!

Richard Yuanzhe Pang , Alicia Parrish , Nitish Joshi , Nikita Nangia , Jason Phang , Angelica Chen , Vishakh Padmakumar , Johnny Ma , Jana Thompson , He He

分类：自然语言处理

2021-12-16

为了实现长文档理解的构建和测试模型，我们引入质量，具有中文段的多项选择QA DataSet，具有约5,000个令牌的平均长度，比典型的当前模型更长。与经过段落的事先工作不同，我们的问题是由阅读整个段落的贡献者编写和验证的，而不是依赖摘要或摘录。此外，只有一半的问题是通过在紧缩时间限制下工作的注释器来应答，表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳（55.4％），并且落后于人类性能（93.5％）。

translated by 谷歌翻译

Teacher Perception of Automatically Extracted Grammar Concepts for L2 Language Learning

Aditi Chaudhary , Arun Sampath , Ashwin Sheshadri , Antonios Anastasopoulos , Graham Neubig

分类：自然语言处理

2022-06-10

语言教学的挑战之一是如何以有意义的方式组织有关语言语法的规则。这不仅需要教学技能，而且还需要对该语言有深刻的了解。虽然开发此类课程的综合材料以英语和一些广泛的语言提供，但对于许多其他语言，教师需要手动创建它们来满足学生的需求。这个过程具有挑战性，因为i）要求这样的专家可以访问并拥有必要的资源，ii）即使有这样的专家，描述了一种语言的所有复杂性，这是耗时的，容易出现遗漏。在本文中，我们提出了一个自动框架，旨在通过自动发现和可视化语法各个方面的描述来促进这一过程。具体而言，我们从自然文本语料库中提取描述，该语料库回答有关形态句法（学习单词顺序，协议，案例标记或单词形成）和语义（学习词汇的学习）的问题，并显示了说明性示例。我们将这种方法用于教授印度语言，卡纳达语和马拉地语，这些方法与英语不同，它们没有发达的教学资源，因此很可能会从这项练习中受益。为了评估提取材料的感知效用，我们获得了北美学校的语言教育者的帮助，这些教育者教这些语言进行手动评估。总体而言，教师认为这些材料是他们自己的课程准备甚至学习者评估的参考材料有趣的。

translated by 谷歌翻译

How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

R. Thomas McCoy , Paul Smolensky , Tal Linzen , Jianfeng Gao , Asli Celikyilmaz

分类：自然语言处理

2021-11-18

当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本，或者他们学习了普遍的语言抽象吗？要取笑这些可能性，我们介绍了乌鸦，这是一套评估生成文本的新颖性，专注于顺序结构（n-gram）和句法结构。我们将这些分析应用于四种神经语言模型（LSTM，变压器，变换器-XL和GPT-2）。对于本地结构 - 例如，单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如，总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖，但模型仍然有时复制，在某些情况下，在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析，表明GPT-2的新文本通常在形态学和语法中形成良好，但具有合理的语义问题（例如，是自相矛盾）。

translated by 谷歌翻译

Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts versus Non-Experts

David Alfter , Therese Lindström Tiedemann , Elena Volodina

分类：自然语言处理

2022-06-17

在这项研究中，我们研究了哪些程度专家和非专家在众包实验中就难度问题达成共识。我们要求非专家（瑞典语的第二语言学习者）和两组专家（瑞典语作为第二/外语的教师和CEFR专家）在众包实验中对多字表达式进行排名。我们发现，所有三个测试小组的最终排名都非常高，这表明在比较环境中产生的判断不受专业见解作为第二语言的专业见解的影响。

translated by 谷歌翻译

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang , Yada Pruksachatkun , Nikita Nangia , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy , Samuel R. Bowman

分类：

2019-05-02

In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.

translated by 谷歌翻译

NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

Kaustubh D. Dhole , Varun Gangal , Sebastian Gehrmann , Aadesh Gupta , Zhenhao Li , Saad Mahamood , Abinaya Mahendiran , Simon Mille , Ashish Srivastava , Samson Tan

分类：自然语言处理 | 人工智能 | 机器学习

2021-12-06

数据增强是自然语言处理（NLP）模型的鲁棒性评估的重要组成部分，以及增强他们培训的数据的多样性。在本文中，我们呈现NL-Cogmenter，这是一种新的参与式Python的自然语言增强框架，它支持创建两个转换（对数据的修改）和过滤器（根据特定功能的数据拆分）。我们描述了框架和初始的117个变换和23个过滤器，用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构，Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用（\ url {https://github.com/gem-benchmark/nl-augmenter}）。

translated by 谷歌翻译

Learning to Reuse Distractors to support Multiple Choice Question Generation in Education

Semere Kiros Bitew , Amir Hadifar , Lucas Sterckx , Johannes Deleu , Chris Develder , Thomas Demeester

分类：自然语言处理

2022-10-25

Multiple choice questions (MCQs) are widely used in digital learning systems, as they allow for automating the assessment process. However, due to the increased digital literacy of students and the advent of social media platforms, MCQ tests are widely shared online, and teachers are continuously challenged to create new questions, which is an expensive and time-consuming task. A particularly sensitive aspect of MCQ creation is to devise relevant distractors, i.e., wrong answers that are not easily identifiable as being wrong. This paper studies how a large existing set of manually created answers and distractors for questions over a variety of domains, subjects, and languages can be leveraged to help teachers in creating new MCQs, by the smart reuse of existing distractors. We built several data-driven models based on context-aware question and distractor representations, and compared them with static feature-based models. The proposed models are evaluated with automated metrics and in a realistic user test with teachers. Both automatic and human evaluations indicate that context-aware models consistently outperform a static feature-based approach. For our best-performing context-aware model, on average 3 distractors out of the 10 shown to teachers were rated as high-quality distractors. We create a performance benchmark, and make it public, to enable comparison between different approaches and to introduce a more standardized evaluation of the task. The benchmark contains a test of 298 educational questions covering multiple subjects & languages and a 77k multilingual pool of distractor vocabulary for future research.

translated by 谷歌翻译

Grammatical Error Correction: A Survey of the State of the Art

Christopher Bryant , Zheng Yuan , Muhammad Reza Qorib , Hannan Cao , Hwee Tou Ng , Ted Briscoe

分类：自然语言处理 | 人工智能

2022-11-09

Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.

translated by 谷歌翻译

Automatic Generation of Programming Exercises and Code Explanations using Large Language Models

Sami Sarsa , Paul Denny , Arto Hellas , Juho Leinonen

分类：人工智能 | 自然语言处理

2022-06-03

本文探讨了大语言模型的自然语言生成能力，并应用于编程课程中常见的两种学习资源类型。使用OpenAI Codex作为大语言模型，我们创建编程练习（包括示例解决方案和测试用例）和代码说明，从定性和定量上评估这些练习。我们的结果表明，大多数自动生成的内容既新颖又明智，在某些情况下可以按原样使用。在创建练习时，我们发现仅通过提供关键字作为模型输入来影响编程概念和它们所包含的上下文主题非常容易。我们的分析表明，大规模生成机器学习模型是指导者的工具，尽管仍然需要进行一些监督以确保生成的内容的质量在传递给学生之前。我们进一步讨论了OpenAI Codex和类似工具对入门编程教育的含义，并强调了未来的研究流，这些研究流有可能提高教师和学生的教育体验质量。

translated by 谷歌翻译

CoQA: A Conversational Question Answering Challenge

Siva Reddy , Danqi Chen , Christopher D. Manning

分类：

2018-08-21

Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. 1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning. We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating there is ample room for improvement. We present CoQA as a challenge to the community at https://stanfordnlp. github.io/coqa.

translated by 谷歌翻译

QuAC : Question Answering in Context

Eunsol Choi , He He , Mohit Iyyer , Mark Yatskar , Wen-tau Yih , Yejin Choi , Percy Liang , Luke Zettlemoyer

分类：

2018-08-21

We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-ofthe-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.

translated by 谷歌翻译

An AI-based Solution for Enhancing Delivery of Digital Learning for Future Teachers

Yong-Bin Kang , Abdur Rahim Mohammad Forkan , Prem Prakash Jayaraman , Natalie Wieland , Elizabeth Kollias , Hung Du , Steven Thomson , Yuan-Fang Li

分类：人工智能

2021-11-09

近期和快速转变为大流行迅速的数字学习，也受到数字工具和平台无处不在的可用性的影响，使数字学习更加接近。扩展数字学习和教学中最困难的部分中的一个积分和一个是能够评估学习者的知识和能力。教育者可以录制讲座或创造数字内容，可以传递到数千名学习者，但评估学习者是非常耗时的。在本文中，我们提出了基于人工智能（AI）的解决方案，即VidVersityQG，用于自动从预先记录的视频讲座产生问题。基于从视频推断的上下文和语义信息，该解决方案可以自动生成不同类型的评估问题（包括短答案，多项选择，真/假并填写空白问题）。所提出的解决方案采用以人为本的方法，其中教师提供了修改/编辑任何AI生成的问题的能力。这种方法鼓励教师参与教育的使用和实施教育。评估了基于AI的解决方案，以便通过我们的行业合作伙伴Vidversity提供给我们的多个域名的经验丰富的教学专业人员和117名教育视频的准确性。 VidVersityQG解决方案显示有希望自动从视频产生高质量问题，从而大大减少了在手动问题中为教育工作者的时间和精力。

translated by 谷歌翻译

Integrating Linguistic Theory and Neural Language Models

Bai Li

分类：自然语言处理

2022-07-20

基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是，通常通过利用大量培训数据来实现排行榜的性能，并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中，我介绍了几个案例研究，以说明理论语言学和神经语言模型仍然相互关联。首先，语言模型通过提供一个客观的工具来测量语义距离，这对语言学家很有用，语义距离很难使用传统方法。另一方面，语言理论通过提供框架和数据源来探究我们的语言模型，以了解语言理解的特定方面，从而有助于语言建模研究。本论文贡献了三项研究，探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中，我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源，我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中，我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明，包含形态句法异常的句子触发了语言模型早期的惊喜，而不是语义和常识异常。最后，在论文的第三部分中，我适应了一些心理语言学研究，以表明语言模型包含了论证结构结构的知识。总而言之，我的论文在自然语言处理，语言理论和心理语言学之间建立了新的联系，以为语言模型的解释提供新的观点。

translated by 谷歌翻译

ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction

Xun Yuan , Derek Pham , Sam Davidson , Zhou Yu

分类：自然语言处理

2021-12-15

使用良好形成的书面文本编译了当前可用的语法错误校正（GEC）数据集，将这些数据集的适用性限制为其他域，例如非正式的写作和对话框。在本文中，我们介绍了从开放式Chatbot对话中汲取的新颖平行GEC数据集;此数据集是我们的知识，将第一个GEC数据集定为会话设置。为了演示数据集的实用程序，我们使用注释的数据来微调最先进的GEC模型，从而提高了模型精度的16点。这在GEC模型中特别重要，因为模型精度被认为比GEC任务中的召回更重要，因为误报可能导致语言学习者的严重混乱。我们还提出了一个详细的注释方案，通过对可靠性的影响来排名错误，使我们的数据集两个可重复和可扩展。实验结果表明，我们的数据在提高了GEC模型性能方面的效果。

translated by 谷歌翻译

The Defeat of the Winograd Schema Challenge

Vid Kocijan , Ernest Davis , Thomas Lukasiewicz , Gary Marcus , Leora Morgenstern

分类：自然语言处理

2022-01-07

Winograd架构挑战 - 一套涉及代词参考消歧的双句话，似乎需要使用致辞知识 - 是由2011年的赫克托勒维克斯提出的。到2019年，基于大型预先训练的变压器的一些AI系统基于语言模型和微调这些问题，精度优于90％。在本文中，我们审查了Winograd架构挑战的历史并评估了其重要性。

translated by 谷歌翻译

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh , Albert Webson , Colin Raffel , Stephen H. Bach , Lintang Sutawika , Zaid Alyafeai , Antoine Chaffin , Arnaud Stiegler , Teven Le Scao , Arun Raja

分类：机器学习 | 自然语言处理

2021-10-15

最近已被证明大型语言模型在各种任务集中获得合理的零射普通化（Brown等，2020）。它已经假设这是语言模型的隐式多任务学习的结果，在语言模型中的预押（Radford等，2019）。可以通过明确的多任务学习直接引起零拍常规化？为了以缩放测试这个问题，我们开发一个系统，以便轻松地将任何自然语言任务映射到人类可读的提示表单中。我们转换一组大量的监督数据集，每个数据集都有多个提示，具有不同的措辞。这些提示的数据集允许基准测试模型执行完全看不见的任务的能力。我们介绍了一个普拉克尔编码器 - 解码器模型（Raffel等，2020; Lester等，2021），覆盖各种任务。该模型在多个标准数据集中达到强大的零点性能，通常优于其尺寸的型号超过16倍。此外，我们的方法对来自Big-替补基准测试的任务子集具有强烈性能，优于其尺寸的6倍。所有提示和培训的型号都可以在https://github.com/ bigscience-workshop / protectsource / httpsource / https：//huggingface.co/bigscience/t0pp。

translated by 谷歌翻译

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai , Qizhe Xie , Hanxiao Liu , Yiming Yang , Eduard Hovy

分类：

2017-04-15

We present RACE, a new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students' ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). We hope this new dataset can serve as a valuable resource for research and evaluation in machine comprehension. The dataset is freely available at http://www.cs.cmu.edu/ ˜glai1/data/race/ and the code is available at https://github.com/ qizhex/RACE_AR_baselines

translated by 谷歌翻译

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , Percy Liang

分类：

2016-06-16

We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com.

translated by 谷歌翻译