拼写错误纠正是自然语言处理中具有很长历史的主题之一。虽然以前的研究取得了显着的结果,但仍然存在挑战。在越南语中,任务的最先进的方法从其相邻音节中介绍了一个音节的上下文。然而,该方法的准确性可能是不令人满意的,因为如果模型可能会失去上下文,如果两个(或更多)拼写错误彼此静置。在本文中,我们提出了一种纠正越南拼写错误的新方法。我们使用深入学习模型解决错误错误和拼写错误错误的问题。特别地,嵌入层由字节对编码技术提供支持。基于变压器架构的序列模型的序列使我们的方法与上一个问题不同于同一问题的方法。在实验中,我们用大型合成数据集训练模型,这是随机引入的拼写错误。我们使用现实数据集测试所提出的方法的性能。此数据集包含11,202个以9,341不同的越南句子中的人造拼写错误。实验结果表明,我们的方法达到了令人鼓舞的表现,检测到86.8%的误差,81.5%纠正,分别提高了最先进的方法5.6%和2.2%。
translated by 谷歌翻译
Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this work, we propose a novel detector-purificator-corrector framework based on denoising transformers by addressing previous issues. Moreover, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach that outperforms previous state-of-the-art methods by a significant margin for Bangla spelling error correction. The models and corpus are publicly available at https://tinyurl.com/DPCSpell.
translated by 谷歌翻译
The lack of label data is one of the significant bottlenecks for Chinese Spelling Check (CSC). Existing researches use the method of automatic generation by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatic generated corpus. Thus, we develop a competitive general speller ECSpell which adopts the Error Consistent masking strategy to create data for pretraining. This error consistency masking strategy is used to specify the error types of automatically generated sentences which is consistent with real scene. The experimental result indicates our model outperforms previous state-of-the-art models on the general benchmark. Moreover, spellers often work within a particular domain in real life. Due to lots of uncommon domain terms, experiments on our built domain specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification based speller. Our experiments demonstrate that ECSpell$^{UD}$, namely ECSpell combined with UD, surpasses all the other baselines largely, even approaching the performance on the general benchmark.
translated by 谷歌翻译
在本文中,我们研究了波斯语的G2P转换的端到端和多模块框架的应用。结果表明,我们提出的多模型G2P系统在准确性和速度方面优于我们的端到端系统。该系统由发音词典作为我们的查找表组成,以及使用GRU和Transformer架构创建的波斯语中的同符,OOV和EZAFE的单独模型。该系统是序列级别而不是单词级别,它使其能够有效地捕获单词(跨字信息)之间的不成文关系,而无需进行任何预处理,而无需进行任何预歧歧义和EZAFE识别。经过评估后,我们的系统达到了94.48%的单词级准确性,表现优于先前的波斯语G2P系统。
translated by 谷歌翻译
波斯语是一种拐点对象 - 动词语言。这一事实使波斯更不确定的语言。但是,使用诸如Zero-Width非加床(ZWNJ)识别,标点符号恢复和波斯ezafe施工的技术将导致我们更加可理解和精确的语言。在波斯的大部分作品中,这些技术是单独解决的。尽管如此,我们认为,对于波斯的文本细化,所有这些任务都是必要的。在这项工作中,我们提出了一个ViraPart框架,它在其核心中使用了嵌入式帕尔兹伯特进行文本澄清。首先,通过分类程序层用于分类过程的分类程序来使用BERT Variant。接下来,我们组合模型输出以输出ClearText。最后,ZWNJ识别,标点恢复和波斯EZAFE施工的提出模型分别执行96.90%,92.13%和98.50%的平均F1宏观分数。实验结果表明,我们的建议方法在波斯语的文本细化中非常有效。
translated by 谷歌翻译
确保适当的标点符号和字母外壳是朝向应用复杂的自然语言处理算法的关键预处理步骤。这对于缺少标点符号和壳体的文本源,例如自动语音识别系统的原始输出。此外,简短的短信和微博的平台提供不可靠且经常错误的标点符号和套管。本调查概述了历史和最先进的技术,用于恢复标点符号和纠正单词套管。此外,突出了当前的挑战和研究方向。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character ngram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English→German and English→Russian by up to 1.1 and 1.3 BLEU, respectively.
translated by 谷歌翻译
由于最近的自然语言处理的进步,几种作品已经将伯特的预先接受审查的屏蔽语言模型(MLM)应用于语音识别的后校正。然而,现有的预先训练的模型仅考虑语义校正,同时忽略了单词的语音特征。因此,语义后校正将降低性能,因为在中国ASR中同音误差相当常见。在本文中,我们提出了一种集体利用了语境化表示的新方法以及错误与其替换候选人之间的语音信息来缓解中国ASR的错误率。我们对现实世界语音识别数据集的实验结果表明,我们所提出的方法明显地低于基线模型的CER,其利用预先训练的BERT MLM作为校正器。
translated by 谷歌翻译
Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Specifically, we first detect whether a token is correct or not through a probability produced by a dedicatedly designed language model, and then design a constrained CTC loss that only duplicates the detected incorrect tokens to let the decoder focus on the correction of error tokens. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect and thus does not need to duplicate every token but only incorrect tokens; compared with explicit error detection, SoftCorrect does not detect specific deletion/substitution/insertion errors but just leaves it to CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming previous works by a large margin, while still enjoying fast speed of parallel generation.
translated by 谷歌翻译
名人认可是品牌交流中最重要的策略之一。如今,越来越多的公司试图为自己建立生动的特征。因此,他们的品牌身份交流应符合人类和法规的某些特征。但是,以前的作品主要是通过假设停止的,而不是提出一种特定的品牌和名人之间匹配的方式。在本文中,我们建议基于自然语言处理(NLP)技术的品牌名人匹配模型(BCM)。鉴于品牌和名人,我们首先从互联网上获得了一些描述性文档,然后总结了这些文档,最后计算品牌和名人之间的匹配程度,以确定它们是否匹配。根据实验结果,我们提出的模型以0.362 F1得分和精度的6.3%优于最佳基线,这表明我们模型在现实世界中的有效性和应用值。更重要的是,据我们所知,拟议的BCM模型是使用NLP解决认可问题的第一项工作,因此它可以为以下工作提供一些新颖的研究思想和方法。
translated by 谷歌翻译
State-of-the-art text simplification (TS) systems adopt end-to-end neural network models to directly generate the simplified version of the input text, and usually function as a blackbox. Moreover, TS is usually treated as an all-purpose generic task under the assumption of homogeneity, where the same simplification is suitable for all. In recent years, however, there has been increasing recognition of the need to adapt the simplification techniques to the specific needs of different target groups. In this work, we aim to advance current research on explainable and controllable TS in two ways: First, building on recently proposed work to increase the transparency of TS systems, we use a large set of (psycho-)linguistic features in combination with pre-trained language models to improve explainable complexity prediction. Second, based on the results of this preliminary task, we extend a state-of-the-art Seq2Seq TS model, ACCESS, to enable explicit control of ten attributes. The results of experiments show (1) that our approach improves the performance of state-of-the-art models for predicting explainable complexity and (2) that explicitly conditioning the Seq2Seq model on ten attributes leads to a significant improvement in performance in both within-domain and out-of-domain settings.
translated by 谷歌翻译
技术的最新进步导致了社交媒体使用的提高,这最终导致了大量的用户生成的数据,这也包括可恨和令人反感的演讲。社交媒体中使用的语言通常是该地区英语和母语的结合。在印度,印地语主要用于使用英语,并经常用英语进行代码开关,从而产生了hinglish(印地语+英语)语言。过去,已经采用了各种方法,以使用不同的机器学习和深度学习技术对混合代码的Hinglish仇恨言论进行分类。但是,这些技术利用了在计算上昂贵且具有高内存要求的卷积机制的复发。过去的技术还可以利用复杂的数据处理,使现有技术非常复杂且不可持续以更改数据。我们提出了一种更简单的方法,不仅与这些复杂的网络相当,而且还超出了子词令牌化算法(如BPE和Umigram)以及基于多头的注意技术的性能,准确性为87.41%,而F1得分为87.41%和F1得分。标准数据集上的0.851。有效地利用BPE和UMIGRAM算法有助于处理非惯性的Hinglish词汇,从而使我们的技术简单,高效且可持续,可在现实世界中使用。
translated by 谷歌翻译
我们想要模型的文本单位是什么?从字节到多字表达式,可以在许多粒度下分析和生成文本。直到最近,大多数自然语言处理(NLP)模型通过单词操作,将那些作为离散和原子令牌处理,但从字节对编码(BPE)开始,基于次字的方法在许多领域都变得占主导地位,使得仍然存在小词汇表允许快速推断。是道路字符级模型的结束或字节级处理吗?在这项调查中,我们通过展示和评估基于学习分割的词语和字符以及基于子字的方法的混合方法以及基于学习的分割的杂交方法,连接多行工作。我们得出结论,对于所有应用来说,并且可能永远不会成为所有应用的银子弹奇异解决方案,并且严重思考令牌化对许多应用仍然很重要。
translated by 谷歌翻译
许多在世界上的许多语言的语言现有数据的非数字化书籍和文件锁定了。光学字符识别(OCR)可以用来产生数字化的文字,和以前的工作已经证明的是提高认识,精心资源较少语言的通用OCR系统的结果神经后校正方法的实用程序。然而,这些方法依赖于手工辅助校正后的数据,相对于非注释原始图像需要被数字化,其是相对稀少。在本文中,我们提出了一种半监督学习方法,使得它可以利用这些原始图像,以提高性能,特别是通过运用自我训练,其中模型迭代自身输出训练有素的技术。此外,为了执行在识别词汇的一致性,我们引入一个词法感知解码方法,该方法增强了神经后修正模型与从所识别的文本构成的基于计数的语言模型,使用加权有限状态自动机中实现(WFSA)对于高效和有效的解码。四种濒危语言的结果证明了该方法的效用,具有15-29%的相对误差减少,我们在哪里找到的自我培训和实现持续改善词法感知解码所必需的组合。数据和代码可在https://shrutirij.github.io/ocr-el/。
translated by 谷歌翻译
西里尔和传统蒙古人是蒙古写作系统的两个主要成员。西里尔传统的蒙古双向转换(CTMBC)任务包括两个转换过程,包括西里尔蒙古人到传统的蒙古人(C2T)和传统的蒙古人到西里尔蒙古人转换(T2C)。以前的研究人员采用了传统的联合序列模型,因为CTMBC任务是自然序列到序列(SEQ2SEQ)建模问题。最近的研究表明,基于反复的神经网络(RNN)和自我注意力(或变压器)的编码器模型模型已显示一些主要语言之间的机器翻译任务有了显着改善,例如普通话,英语,法语等。但是,对于是否可以利用RNN和变压器模型可以改善CTMBC质量,仍然存在开放问题。为了回答这个问题,本文研究了这两种强大的CTMBC任务技术的实用性,并结合了蒙古语的凝集特征。我们分别基于RNN和Transformer构建基于编码器的CTMBC模型,并深入比较不同的网络配置。实验结果表明,RNN和Transformer模型都优于传统的关节序列模型,其中变压器可以达到最佳性能。与关节序列基线相比,C2T和T2C的变压器的单词错误率(WER)分别降低了5.72 \%和5.06 \%。
translated by 谷歌翻译
Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the popular NAR models adopted in neural machine translation and text edition by a large margin.
translated by 谷歌翻译
命名实体识别是一项信息提取任务,可作为其他自然语言处理任务的预处理步骤,例如机器翻译,信息检索和问题答案。命名实体识别能够识别专有名称以及开放域文本中的时间和数字表达式。对于诸如阿拉伯语,阿姆哈拉语和希伯来语之类的闪族语言,由于这些语言的结构严重变化,指定的实体识别任务更具挑战性。在本文中,我们提出了一个基于双向长期记忆的Amharic命名实体识别系统,并带有条件随机字段层。我们注释了一种新的Amharic命名实体识别数据集(8,070个句子,具有182,691个令牌),并将合成少数群体过度采样技术应用于我们的数据集,以减轻不平衡的分类问题。我们命名的实体识别系统的F_1得分为93%,这是Amharic命名实体识别的新最新结果。
translated by 谷歌翻译
来自文本的采矿因果关系是一种复杂的和至关重要的自然语言理解任务,对应于人类认知。其解决方案的现有研究可以分为两种主要类别:基于特征工程和基于神经模型的方法。在本文中,我们发现前者具有不完整的覆盖范围和固有的错误,但提供了先验知识;虽然后者利用上下文信息,但其因果推断不足。为了处理限制,我们提出了一个名为MCDN的新型因果关系检测模型,明确地模拟因果关系,而且,利用两种方法的优势。具体而言,我们采用多头自我关注在Word级别获得语义特征,并在段级别推断出来的SCRN。据我们所知,关于因果关系任务,这是第一次应用关系网络。实验结果表明:1)该方法对因果区检测进行了突出的性能; 2)进一步分析表现出MCDN的有效性和稳健性。
translated by 谷歌翻译
中文拼写检查(CSC)的任务旨在检测和纠正文本中可以找到的拼写错误。虽然手动注释高质量的数据集很昂贵且耗时,因此培训数据集的规模通常很小(例如,Sighan15仅包含2339个用于培训的样本),因此基于学习的模型通常会遭受数据稀疏限制。和过度合适的问题,尤其是在大语言模型时代。在本文中,我们致力于研究\ textbf {无监督}范式来解决CSC问题,我们提出了一个名为\ textbf {uchecker}的框架,以进行无监督的拼写错误检测和校正。考虑到其强大的语言诊断能力,将蒙面审慎的语言模型(例如BERT)引入为骨干模型。从各种且灵活的掩蔽操作中受益,我们提出了一种混乱集引导的掩盖策略,以精细培训掩盖语言模型,以进一步提高无监督的检测和校正的性能。标准数据集的实验结果证明了我们提出的模型Uchecker在字符级别和句子级的准确性,精度,回忆和F1分别对拼写错误检测和校正任务分别进行的效率。
translated by 谷歌翻译