智能论文笔记

我们介绍了在Fire 2021举行的Dravidian-Codemix共享任务的结果，是代码混合文本中的Dravidian语言的情绪分析轨道。我们描述了任务，其组织和提交的系统。这种共享任务是去年的Dravidian-Codemix共享任务的延续，在火灾2020举行。今年的任务包括在令牌内部和令互相互补级别的代码混合。此外，除了泰米尔和马拉雅拉姆，还介绍。我们收到了22种Tamil-English，15个用于Malayalam-English系统的系统和15个用于Kannada-English。Tamil-English，Malayalam-English和Kannada-English的顶级系统分别获得加权平均F1分，分别为0.711,0.804和0.630分。总之，提交的质量和数量表明，在这种域中的代码混合设置和最先进状态下对Dravidian语言有很大的兴趣仍然需要更多的改进。

translated by 谷歌翻译

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Charangan Vasantharajan , Laksika Tharmalingam , Uthayasanker Thayasivam

分类：自然语言处理

2021-09-13

Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed document. It is shown that this approach can reduce the character-level error rate of Tesseract from 6.03 to 2.61 for Tamil (-3.42% relative change) and 7.61 to 4.74 for Sinhala (-2.87% relative change), as well as the word-level error rate from 39.68 to 20.61 for Tamil (-19.07% relative change) and 35.04 to 26.58 for Sinhala (-8.46% relative change) on the test set. Also, our newly created parallel corpus consists of 185.4k, 168.9k, and 181.04k sentences and 2.11M, 2.22M, and 2.33M Words in Tamil, Sinhala, and English respectively. This study shows that fine-tuning Tesseract models on multiple new fonts help to understand the texts and enhances the performance of the OCR. We made newly trained models and the source code for fine-tuning Tesseract, freely available.

translated by 谷歌翻译