数据稀疏性是语法误差校正(GEC)的众所周知的问题。生成合成训练数据是针对此问题的一种广泛提出的解决方案,并允许模型近年来实现最新的(SOTA)性能。但是,这些方法通常会产生不切实际的错误,或者旨在仅一个错误生成句子。我们提出了一种基于学习的两个阶段方法,用于GEC的合成数据生成,从而放宽了仅包含一个错误的句子的约束。错误是根据句子优点产生的。我们表明,经过合成生成的语料库训练的GEC模型优于先前工作的合成数据的模型。
translated by 谷歌翻译
由于当前语法纠错(GEC)任务中缺乏并行数据,基于序列框架的模型不能充分培训以获得更高的性能。我们提出了两个数据合成方法,可以控制误差率和合成数据对误差类型的比率。第一种方法是用固定概率损坏单声道语料库中的每个单词,包括更换,插入和删除。另一种方法是培训误差生成模型并进一步过滤模型的解码结果。对不同合成数据的实验表明,误差率为40%,误差类型的比率相同,可以提高模型性能。最后,我们综合了大约1亿数据并实现了与现有技术的可比性,它使用了我们使用的两倍。
translated by 谷歌翻译
In this paper, we present strong baselines for the task of Feedback Comment Generation for Writing Learning. Given a sentence and an error span, the task is to generate a feedback comment explaining the error. Sentences and feedback comments are both in English. We experiment with LLMs and also create multiple pseudo datasets for the task, investigating how it affects the performance of our system. We present our results for the task along with extensive analysis of the generated comments with the aim of aiding future studies in feedback comment generation for English language learners.
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
本文提出了一个简单的食谱,用于训练最先进的多语言语法误差校正(GEC)模型。我们首先提出一种语言不足的方法来实现这一目标,以生成大量的合成示例。第二个成分是使用大规模的多语言模型(最多11B参数)。一旦对特定于语言的监督集进行了微调,我们就会以四种语言的GEC基准进行以前的最新结果:英语,捷克语,德语和俄语。在为GEC建立了一套新的基线后,我们通过释放Clang-8数据集使结果可以轻松地重现和访问。它是通过使用我们称为GT5的最佳型号来清洁广泛使用但嘈杂的Lang-8数据集的目标而产生的。 Clang-8极大地简化了由多个微调阶段组成的典型GEC训练管道 - 我们证明,使用现成的语言模型在Clang-8上执行单个微调步骤,可以进一步改善已经是顶级的,为英语执行GT5型号。
translated by 谷歌翻译
This paper presents a solution to the GenChal 2022 shared task dedicated to feedback comment generation for writing learning. In terms of this task given a text with an error and a span of the error, a system generates an explanatory note that helps the writer (language learner) to improve their writing skills. Our solution is based on fine-tuning the T5 model on the initial dataset augmented according to syntactical dependencies of the words located within indicated error span. The solution of our team "nigula" obtained second place according to manual evaluation by the organizers.
translated by 谷歌翻译
惯用表达式(IES)在自然语言中起重要作用。在本文中,我们研究了惯用句子解释(ISP)的任务,旨在通过用IE用文字解释来解释一个句子。缺乏与惯用语文平行句子的大型语料库是这项任务的主要挑战,我们考虑了两个单独的解决方案。首先,我们向ISP提出了一个无人监督的方法,它利用IE的上下文信息和定义,不需要并行句子训练集。其次,我们提出了一种弱监督的方法,使用后翻来的方法与IE共同执行释义和生成句子,以扩大小规模并行句子训练数据集。该研究的其他重要衍生物包括一种模型,该模型将句子中的文字短语替换为一种与IE生成惯用表达式和具有惯用/文字句对的大规模并行数据集。拟议的解决方案与竞争性基线相比的有效性在Bleu超过5.16点的相对增益中观察到超过8.75点,在使用自动和手动的并行数据集上经验上验证生成的句子时,Sari超过19.57点评估。我们展示了ISP作为EN-DE机器翻译中的预处理步骤的实用实用性。
translated by 谷歌翻译
逆文本归一化(ITN)是自动语音识别(ASR)中必不可少的后处理步骤。它将数字,日期,缩写和其他符号类别从ASR产生的口头形式转换为其书面形式。人们可以将ITN视为机器翻译任务,并使用神经序列到序列模型来解决它。不幸的是,这种神经模型容易产生可能导致不可接受的错误的幻觉。为了减轻此问题,我们提出了一个单个令牌分类器模型,将ITN视为标记任务。该模型将替换片段分配给每个输入令牌,或将其标记为删除或复制而无需更改。我们提出了基于ITN示例的粒状对齐方式的数据集准备方法。提出的模型不太容易出现幻觉错误。该模型在Google文本归一化数据集上进行了培训,并在英语和俄罗斯测试集上实现了最先进的句子精度。标签和输入单词之间的一对一对应关系可改善模型预测的解释性,简化调试并允许后处理更正。该模型比序列到序列模型更简单,并且在生产设置中更易于优化。准备数据集的模型和代码作为NEMO项目的一部分发布。
translated by 谷歌翻译
With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M.
translated by 谷歌翻译
自然语言理解(NLU)研究中的核心问题是高性能是否展示了模型的强大推理能力。我们提出了一系列广泛的受控实验,其中预先接受了训练的语言模型被暴露于经历特定的损坏变换的数据。转换涉及去除特定词类的实例,并且经常导致非感性句子。我们的研究结果表明,当模型在损坏的数据上进行微调或测试时,大多数胶水任务的性能仍然很高,表明模型即使在非感性背景下也可以利用其他线索进行预测。我们所提出的数据转换可以用作评估特定数据集构成适当测试设备的诊断工具,用于评估模型的语言理解能力。
translated by 谷歌翻译
自动语音识别(ASR)中编辑的后编辑需要自动纠正ASR系统产生的常见和系统错误。 ASR系统的输出在很大程度上容易出现语音和拼写错误。在本文中,我们建议使用强大的预训练的序列模型BART,BART进一步适应训练以作为剥夺模型,以纠正此类类型的错误。自适应培训是在通过合成诱导错误以及通过合并现有ASR系统中的实际错误获得的增强数据集上执行的。我们还提出了一种简单的方法,可以使用单词级别对齐来恢复输出。对重音语音数据的实验结果表明,我们的策略有效地纠正了大量的ASR错误,并在与竞争性基线相比时会产生改善的结果。我们还强调了在印地语语言中相关的语法误差校正任务中获得的负面结果,显示了通过我们建议的模型捕获更广泛上下文的限制。
translated by 谷歌翻译
In the absence of readily available labeled data for a given task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data which may then be used to train supervised systems. Annotation projection has often been formulated as the task of projecting, on parallel corpora, some labels from a source into a target language. In this paper we present T-Projection, a new approach for annotation projection that leverages large pretrained text2text language models and state-of-the-art machine translation technology. T-Projection decomposes the label projection task into two subtasks: (i) The candidate generation step, in which a set of projection candidates using a multilingual T5 model is generated and, (ii) the candidate selection step, in which the candidates are ranked based on translation probabilities. We evaluate our method in three downstream tasks and five different languages. Our results show that T-projection improves the average F1 score of previous methods by more than 8 points.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
在本文中,我们描述了Case-2022中的子任务2(与休闲新闻语料库的事件因果关系识别)中的共享任务提交。挑战的重点是自动检测新闻媒体中句子中存在的所有因果信号跨度。我们使用T5(一种预先训练的自回归语言模型)检测句子中的因果信号跨度。我们迭代地识别所有原因效应信号跨度三重态,始终在先前预测的三胞胎上预测下一个三重态。为了预测三胞胎本身,我们考虑了不同的因果关系,例如$ \ rightarrow $效果$ \ rightarrow $信号。每个三重态组件都是通过在句子上,当前三重态的前部以及先前预测的三胞胎的语言模型生成的。尽管在一个非常小的160个样本数据集上进行了培训,但我们的方法仍取得了竞争性能,并在比赛中排名第二。此外,我们表明,假设$ \ rightarrow $效果或效果$ \ rightarrow $导致订单实现相似的结果。我们的代码和模型预测将在线发布。
translated by 谷歌翻译
我们在佩恩 - 赫尔辛基解析的早期现代英语(PPCEME)中的第一个解析结果,是一个190万字的TreeBank,这是句法变化研究的重要资源。我们描述了PPCEME的关键特征,使其成为解析的挑战,包括比Penn TreeBank中更大且更多样化的功能标签。我们使用伯克利神经解析器的修改版本为此语料库提出了结果,以及Gabbard等人的功能标签恢复的方法(2006)。尽管其简单性,这种方法令人惊讶地令人惊讶地令人惊讶的是,建议可以以足够的准确度恢复原始结构,以支持语言应用(例如,寻找涉及的句法结构)。然而,对于函数标签的子集(例如,指示直接演讲的标签),需要额外的工作,我们讨论了这种方法的一些进一步限制。由此产生的解析器将用于在网上解析早期英语书籍,一个11亿字形的语料库,其实用性对于句法变化的效用将大大增加,加入准确的解析树。
translated by 谷歌翻译
大型审慎的语言模型最近征服了自然语言处理领域。作为BERT中引入的主要掩盖语言建模的替代方案,T5模型引入了更通用的训练目标,即序列转换的顺序,其中包括蒙版语言模型,但自然地适合文本生成任务,例如机器翻译,摘要,开放 - 开放 - 域问题回答,文本简化,对话系统等。T5模型的单语变体仅限于资源良好的语言,而大量的多语言T5模型则支持101种语言。相比之下,我们训练了两个不同尺寸的T5型序列,以使用较少的资源并分析其行为的形态丰富的斯洛文尼语的序列模型。关于分类任务,SLOT5模型主要落后于单语Slovene Sloberta模型,但应考虑生成任务。
translated by 谷歌翻译
自动言论(POS)标记是许多自然语言处理(NLP)任务的预处理步骤,例如名称实体识别(NER),语音处理,信息提取,单词sense sisse disampigation和Machine Translation。它已经在英语和欧洲语言方面取得了令人鼓舞的结果,但是使用印度语言,尤其是在Odia语言中,由于缺乏支持工具,资源和语言形态丰富性,因此尚未得到很好的探索。不幸的是,我们无法为ODIA找到一个开源POS标记,并且仅尝试为ODIA语言开发POS标记器的尝试。这项研究工作的主要贡献是介绍有条件的随机场(CRF)和基于深度学习的方法(CNN和双向长期短期记忆)来开发ODIA的语音部分。我们使用了一个公开访问的语料库,并用印度标准局(BIS)标签设定了数据集。但是,全球的大多数语言都使用了带有通用依赖项(UD)标签集注释的数据集。因此,要保持统一性,odia数据集应使用相同的标签集。因此,我们已经构建了一个从BIS标签集到UD标签集的简单映射。我们对CRF模型进行了各种特征集输入,观察到构造特征集的影响。基于深度学习的模型包括BI-LSTM网络,CNN网络,CRF层,角色序列信息和预训练的单词向量。通过使用卷积神经网络(CNN)和BI-LSTM网络提取角色序列信息。实施了神经序列标记模型的六种不同组合,并研究了其性能指标。已经观察到具有字符序列特征和预训练的单词矢量的BI-LSTM模型取得了显着的最新结果。
translated by 谷歌翻译
Unavailability of parallel corpora for training text style transfer (TST) models is a very challenging yet common scenario. Also, TST models implicitly need to preserve the content while transforming a source sentence into the target style. To tackle these problems, an intermediate representation is often constructed that is devoid of style while still preserving the meaning of the source sentence. In this work, we study the usefulness of Abstract Meaning Representation (AMR) graph as the intermediate style agnostic representation. We posit that semantic notations like AMR are a natural choice for an intermediate representation. Hence, we propose T-STAR: a model comprising of two components, text-to-AMR encoder and a AMR-to-text decoder. We propose several modeling improvements to enhance the style agnosticity of the generated AMR. To the best of our knowledge, T-STAR is the first work that uses AMR as an intermediate representation for TST. With thorough experimental evaluation we show T-STAR significantly outperforms state of the art techniques by achieving on an average 15.2% higher content preservation with negligible loss (3% approx.) in style accuracy. Through detailed human evaluation with 90,000 ratings, we also show that T-STAR has up to 50% lesser hallucinations compared to state of the art TST models.
translated by 谷歌翻译
Damage to the inferior frontal gyrus (Broca's area) can cause agrammatic aphasia wherein patients, although able to comprehend, lack the ability to form complete sentences. This inability leads to communication gaps which cause difficulties in their daily lives. The usage of assistive devices can help in mitigating these issues and enable the patients to communicate effectively. However, due to lack of large scale studies of linguistic deficits in aphasia, research on such assistive technology is relatively limited. In this work, we present two contributions that aim to re-initiate research and development in this field. Firstly, we propose a model that uses linguistic features from small scale studies on aphasia patients and generates large scale datasets of synthetic aphasic utterances from grammatically correct datasets. We show that the mean length of utterance, the noun/verb ratio, and the simple/complex sentence ratio of our synthetic datasets correspond to the reported features of aphasic speech. Further, we demonstrate how the synthetic datasets may be utilized to develop assistive devices for aphasia patients. The pre-trained T5 transformer is fine-tuned using the generated dataset to suggest 5 corrected sentences given an aphasic utterance as input. We evaluate the efficacy of the T5 model using the BLEU and cosine semantic similarity scores. Affirming results with BLEU score of 0.827/1.00 and semantic similarity of 0.904/1.00 were obtained. These results provide a strong foundation for the concept that a synthetic dataset based on small scale studies on aphasia can be used to develop effective assistive technology.
translated by 谷歌翻译
Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is beneficial for classification tasks, it does not cater well for sub-word prediction tasks, such as Named Entity Recognition, when considering the morphologically rich nature of Hebrew. In this paper we argue that sequence-to-sequence generative architectures are more suitable for LLMs in the case of morphologically rich languages (MRLs) such as Hebrew. We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5, eliminating the need for a specialized, morpheme-based, separately fine-tuned decoder. Using this approach, our experiments show substantial improvements over previously published results on existing Hebrew NLP benchmarks. These results suggest that multilingual sequence-to-sequence models present a promising building block for NLP for MRLs.
translated by 谷歌翻译