在本文中,我们介绍了CTC 2021的概述,这是针对母语人士的中文文本校正任务。我们详细描述了任务定义以及培训和评估的数据。我们还总结了该任务参与者调查的方法。我们希望为此任务收集和注释的数据集可以促进并加快该研究领域的未来发展。因此,伪培训数据,金标准验证数据和整个排行榜可在https://destwang.github.io/ctc2021-explorer/上在线公开获取。
translated by 谷歌翻译
学习者语料库收集L2学习者产生的语言数据,即第二或外语学习者。这种资源与第二语言采集研究,外语教学和自动语法纠错有关。但是,几乎没有焦点汉语作为外语(CFL)学习者的学习者语料库。因此,我们建议构建大规模的多维注释的中国学习者语料库。要构建语料库,我们首先获得CFL学习者生成的大量富有的富主题文本。然后我们设计一个注释方案,包括句子可接受性得分以及语法错误和基于流畅的校正。我们构建一个众群平台,有效地执行注释(https://yaclc.wenmind.net)。我们命名语料库yaclc(又一个中国学习者语料库)并将其释放为Cuge基准(http://cuge.baai.ac.cn)。通过分析语料库中的原始句子和注释,我们发现Yaclc具有相当大的尺寸和非常高的注释质量。我们希望这项语料库能够进一步加强中国国际教育和中国自动语法纠错的研究。
translated by 谷歌翻译
随着移动计算和网络技术的快速增长,令人反感的语言在社交网络平台上变得更加普遍。由于本地语言的令人反感语言识别对于中等社交媒体内容至关重要,因此在本文中,我们使用三种Dravidian语言,即Malayalam,Tamil和Kannada,这些语言遭到资源。我们在EACL 2021的Fire 2020- Hasoc-DravidiancodeMix和Dravidianlangtech提供了一个评估任务,旨在提供一个比较不同方法对此问题的框架。本文介绍了数据创建,定义任务,列出参与系统,并讨论各种方法。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
中文拼写检查(CSC)任务旨在检测和纠正中文拼写错误。近年来,相关研究的重点是引入“混乱设置”以增强CSC模型的角色相似性,忽略了包含更丰富信息的字符的上下文。为了更好地利用上下文相似性,我们为CSC任务提供了一个简单而有效的课程学习框架。借助我们设计的模型不足框架,现有的CSC型号将从人类学习汉字并取得进一步改进的培训。对广泛使用的Sighan数据集进行了广泛的实验和详细分析表明,我们的方法的表现优于先前的最新方法。
translated by 谷歌翻译
拼写错误纠正是自然语言处理中具有很长历史的主题之一。虽然以前的研究取得了显着的结果,但仍然存在挑战。在越南语中,任务的最先进的方法从其相邻音节中介绍了一个音节的上下文。然而,该方法的准确性可能是不令人满意的,因为如果模型可能会失去上下文,如果两个(或更多)拼写错误彼此静置。在本文中,我们提出了一种纠正越南拼写错误的新方法。我们使用深入学习模型解决错误错误和拼写错误错误的问题。特别地,嵌入层由字节对编码技术提供支持。基于变压器架构的序列模型的序列使我们的方法与上一个问题不同于同一问题的方法。在实验中,我们用大型合成数据集训练模型,这是随机引入的拼写错误。我们使用现实数据集测试所提出的方法的性能。此数据集包含11,202个以9,341不同的越南句子中的人造拼写错误。实验结果表明,我们的方法达到了令人鼓舞的表现,检测到86.8%的误差,81.5%纠正,分别提高了最先进的方法5.6%和2.2%。
translated by 谷歌翻译
While pre-trained Chinese language models have demonstrated impressive performance on a wide range of NLP tasks, the Chinese Spell Checking (CSC) task remains a challenge. Previous research has explored using information such as glyphs and phonetics to improve the ability to distinguish misspelled characters, with good results. However, the generalization ability of these models is not well understood: it is unclear whether they incorporate glyph-phonetic information and, if so, whether this information is fully utilized. In this paper, we aim to better understand the role of glyph-phonetic information in the CSC task and suggest directions for improvement. Additionally, we propose a new, more challenging, and practical setting for testing the generalizability of CSC models. All code is made publicly available.
translated by 谷歌翻译
用户生成的内容充满了拼写错误。我们假设许多拼写错误的语义不仅仅是随机噪音,而是可以利用隐藏的语义来理解语言理解任务。本文提出了泰语中拼写错误的注释语料库,以及对拼写意图及其可能的语义的分析,以更好地理解语料库中观察到的拼写模式。此外,我们介绍了两种方法,以结合拼写错误的语义:拼写的平均嵌入(MAE)和拼写的语义令牌(MST)。情感分析任务的实验证实了我们的总体假设:拼写错误的其他语义可以提高微F1得分高达0.4-2%,而盲目正常化的拼写错误是有害的和次优的。
translated by 谷歌翻译
In this paper, we present strong baselines for the task of Feedback Comment Generation for Writing Learning. Given a sentence and an error span, the task is to generate a feedback comment explaining the error. Sentences and feedback comments are both in English. We experiment with LLMs and also create multiple pseudo datasets for the task, investigating how it affects the performance of our system. We present our results for the task along with extensive analysis of the generated comments with the aim of aiding future studies in feedback comment generation for English language learners.
translated by 谷歌翻译
已知深神经模型对输入噪声的敏感性是一个具有挑战性的问题。在NLP中,模型性能通常与自然发生的噪声恶化,例如拼写错误。要缓解此问题,模型可能会利用人为中断数据。然而,到目前为止已经任意确定产生的噪声的量和类型。因此,我们建议统计从语法纠错的语料库统计上的错误。我们对多种语言的若干先进的NLP系统进行了彻底的评估,其中任务包括句法分析,名为实体识别,神经机翻译,胶水基准和阅读理解的子集。我们还比较两种解决性能下降的方法:a)培训我们框架生成的中断数据的NLP模型;b)减少外部系统进行自然语言校正的输入噪声。代码在https://github.com/ufal/kazitext上发布。
translated by 谷歌翻译
由于最近的自然语言处理的进步,几种作品已经将伯特的预先接受审查的屏蔽语言模型(MLM)应用于语音识别的后校正。然而,现有的预先训练的模型仅考虑语义校正,同时忽略了单词的语音特征。因此,语义后校正将降低性能,因为在中国ASR中同音误差相当常见。在本文中,我们提出了一种集体利用了语境化表示的新方法以及错误与其替换候选人之间的语音信息来缓解中国ASR的错误率。我们对现实世界语音识别数据集的实验结果表明,我们所提出的方法明显地低于基线模型的CER,其利用预先训练的BERT MLM作为校正器。
translated by 谷歌翻译
一些语法误差校正(GEC)系统结合了手工制作的规则并获得积极的结果。但是,手动定义规则是耗时和费力的。鉴于此,我们提出了一种方法来自动开采GEC的错误模板。错误模板是旨在识别文本错误的正则表达式。我们使用Web搜寻器从Internet获取此类错误模板。对于每个模板,我们通过使用语言模型困惑作为标准进一步选择相应的纠正措施。基于此方法,我们为中国GEC积累了1,119个错误模板。新提出的CTC-2021中国GEC基准的实验结果表明,梳理我们的误差模板可以有效地改善强GEC系统的性能,尤其是在两种错误类型上,培训数据很少。我们的错误模板可在\ url {https://github.com/hillzhang1999/gec_error_template}中获得。
translated by 谷歌翻译
本文概述了与CRAC 2022研讨会相关的多语言核心分辨率的共享任务。共同的任务参与者应该开发能够识别提及并根据身份核心重点聚集的训练系统。Corefud 1.0的公共版本包含10种语言的13个数据集,被用作培训和评估数据的来源。先前面向核心共享任务中使用的串联分数用作主要评估度量。5个参与团队提交了8个核心预测系统;此外,组织者在共享任务开始时提供了一个基于竞争变压器的基线系统。获胜者系统的表现优于基线12个百分点(就所有语言的所有数据集而言,在所有数据集中平均得分)。
translated by 谷歌翻译
使用良好形成的书面文本编译了当前可用的语法错误校正(GEC)数据集,将这些数据集的适用性限制为其他域,例如非正式的写作和对话框。在本文中,我们介绍了从开放式Chatbot对话中汲取的新颖平行GEC数据集;此数据集是我们的知识,将第一个GEC数据集定为会话设置。为了演示数据集的实用程序,我们使用注释的数据来微调最先进的GEC模型,从而提高了模型精度的16点。这在GEC模型中特别重要,因为模型精度被认为比GEC任务中的召回更重要,因为误报可能导致语言学习者的严重混乱。我们还提出了一个详细的注释方案,通过对可靠性的影响来排名错误,使我们的数据集两个可重复和可扩展。实验结果表明,我们的数据在提高了GEC模型性能方面的效果。
translated by 谷歌翻译
We report the result of the first edition of the WMT shared task on Translation Suggestion (TS). The task aims to provide alternatives for specific words or phrases given the entire documents generated by machine translation (MT). It consists two sub-tasks, namely, the naive translation suggestion and translation suggestion with hints. The main difference is that some hints are provided in sub-task two, therefore, it is easier for the model to generate more accurate suggestions. For sub-task one, we provide the corpus for the language pairs English-German and English-Chinese. And only English-Chinese corpus is provided for the sub-task two. We received 92 submissions from 5 participating teams in sub-task one and 6 submissions for the sub-task 2, most of them covering all of the translation directions. We used the automatic metric BLEU for evaluating the performance of each submission.
translated by 谷歌翻译
The lack of label data is one of the significant bottlenecks for Chinese Spelling Check (CSC). Existing researches use the method of automatic generation by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatic generated corpus. Thus, we develop a competitive general speller ECSpell which adopts the Error Consistent masking strategy to create data for pretraining. This error consistency masking strategy is used to specify the error types of automatically generated sentences which is consistent with real scene. The experimental result indicates our model outperforms previous state-of-the-art models on the general benchmark. Moreover, spellers often work within a particular domain in real life. Due to lots of uncommon domain terms, experiments on our built domain specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification based speller. Our experiments demonstrate that ECSpell$^{UD}$, namely ECSpell combined with UD, surpasses all the other baselines largely, even approaching the performance on the general benchmark.
translated by 谷歌翻译
许多在世界上的许多语言的语言现有数据的非数字化书籍和文件锁定了。光学字符识别(OCR)可以用来产生数字化的文字,和以前的工作已经证明的是提高认识,精心资源较少语言的通用OCR系统的结果神经后校正方法的实用程序。然而,这些方法依赖于手工辅助校正后的数据,相对于非注释原始图像需要被数字化,其是相对稀少。在本文中,我们提出了一种半监督学习方法,使得它可以利用这些原始图像,以提高性能,特别是通过运用自我训练,其中模型迭代自身输出训练有素的技术。此外,为了执行在识别词汇的一致性,我们引入一个词法感知解码方法,该方法增强了神经后修正模型与从所识别的文本构成的基于计数的语言模型,使用加权有限状态自动机中实现(WFSA)对于高效和有效的解码。四种濒危语言的结果证明了该方法的效用,具有15-29%的相对误差减少,我们在哪里找到的自我培训和实现持续改善词法感知解码所必需的组合。数据和代码可在https://shrutirij.github.io/ocr-el/。
translated by 谷歌翻译
Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this work, we propose a novel detector-purificator-corrector framework based on denoising transformers by addressing previous issues. Moreover, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach that outperforms previous state-of-the-art methods by a significant margin for Bangla spelling error correction. The models and corpus are publicly available at https://tinyurl.com/DPCSpell.
translated by 谷歌翻译
仇恨言论等攻击性内容的广泛构成了越来越多的社会问题。 AI工具是支持在线平台的审核过程所必需的。为了评估这些识别工具,需要与不同语言的数据集进行连续实验。 HASOC轨道(仇恨语音和冒犯性内容识别)专用于为此目的开发基准数据。本文介绍了英语,印地语和马拉地赛的Hasoc Subtrack。数据集由Twitter组装。此子系统有两个子任务。任务A是为所有三种语言提供的二进制分类问题(仇恨而非冒犯)。任务B是三个课程(仇恨)仇恨言论,令人攻击和亵渎为英语和印地语提供的细粒度分类问题。总体而言,652名队伍提交了652次。任务A最佳分类算法的性能分别为Marathi,印地语和英语的0.91,0.78和0.83尺寸。此概述介绍了任务和数据开发以及详细结果。提交竞争的系统应用了各种技术。最好的表演算法主要是变压器架构的变种。
translated by 谷歌翻译
Sigmorphon 2022关于词素分割的共享任务挑战了将单词分解为一系列词素的系统,并涵盖了大多数类型的形态:化合物,衍生和弯曲。子任务1,单词级词素细分,涵盖了9种语言的500万个单词(捷克,英语,西班牙语,匈牙利语,法语,意大利语,俄语,拉丁语,蒙古语),并收到了7个团队的13个系统提交,最佳系统平均为97.29%F1在所有语言中得分,英语(93.84%)到拉丁语(99.38%)。子任务2,句子级的词素细分,涵盖了3种语言的18,735个句子(捷克,英语,蒙古人),从3个团队中收到10个系统提交,最好的系统优于所有三种最先进的子字体化方法(BPE(BPE),Ulm,Morfessor2)绝对30.71%。为了促进错误分析并支持任何类型的未来研究,我们发布了所有系统预测,评估脚本和所有黄金标准数据集。
translated by 谷歌翻译