智能论文笔记

Named Entity Recognition in Indian court judgments

Prathamesh Kalamkar , Astha Agarwal , Aman Tiwari , Smita Gupta , Saurabh Karn , Vivek Raghavan

分类：自然语言处理 | 人工智能

2022-11-07

Identification of named entities from legal texts is an essential building block for developing other legal Artificial Intelligence applications. Named Entities in legal texts are slightly different and more fine-grained than commonly used named entities like Person, Organization, Location etc. In this paper, we introduce a new corpus of 46545 annotated legal named entities mapped to 14 legal entity types. The Baseline model for extracting legal named entities from judgment text is also developed.

translated by 谷歌翻译

Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?

Priyanshi Shah , Harveen Singh Chadha , Anirudh Gupta , Ankur Dhuriya , Neeraj Chhimwal , Rishabh Gaur , Vivek Raghavan

分类：自然语言处理

2022-03-30

我们提出了一种用于计算自动语音识别（ASR）中错误率的新方法。这个新的指标是针对包含半字符的语言，可以以不同形式编写相同的字符。我们在印地语中实施了我们的方法论，这是指示上下文中的主要语言之一，我们认为这种方法可扩展到包含大型字符集的其他类似语言。我们称我们的指标替代单词错误率（AWER）和替代字符错误率（ACER）。我们使用wav2Vec 2.0 \ cite {baevski2020wav2vec}训练我们的ASR模型。此外，我们使用语言模型来改善我们的模型性能。我们的结果表明，在分析单词和角色级别的错误率方面有了显着提高，ASR系统的可解释性提高了高达$ 3 $ \％的AWER，印地语的ACER $ 7 $ \％。我们的实验表明，在具有复杂发音的语言中，有多种写单词而不改变其含义的方式。在这种情况下，Awer和Acer将更有用，而不是将其作为指标。此外，我们通过新的公制脚本为印地语开了一个21小时的新基准测试数据集。

translated by 谷歌翻译

Improving Speech Recognition for Indic Languages using Language Model

Ankur Dhuriya , Harveen Singh Chadha , Anirudh Gupta , Priyanshi Shah , Neeraj Chhimwal , Rishabh Gaur , Vivek Raghavan

分类：自然语言处理

2022-03-30

我们研究应用语言模型（LM）对指示语言自动语音识别（ASR）系统输出的影响。我们微调WAV2VEC $ 2.0 $型号的$ 18 $指示性语言，并通过根据各种来源派生的文本训练的语言模型调整结果。我们的发现表明，平均字符错误率（CER）降低了$ 28 $ \％，平均单词错误率（WER）在解码LM后降低了$ 36 $ \％。我们表明，与多样化的LM相比，大型LM可能无法提供实质性的改进。我们还证明，可以在特定于域的数据上获得高质量的转录，而无需重新培训ASR模型并显示了生物医学领域的结果。

translated by 谷歌翻译

Code Switched and Code Mixed Speech Recognition for Indic languages

Harveen Singh Chadha , Priyanshi Shah , Ankur Dhuriya , Neeraj Chhimwal , Anirudh Gupta , Vivek Raghavan

分类：自然语言处理

2022-03-30

培训多语言自动语音识别（ASR）系统具有挑战性，因为声学和词汇信息通常是特定于语言的。由于缺乏开源数据集和不同方法的结果，培训对Indo语言的多语言系统更加困难。我们将端到端多语言语音识别系统的性能与以语言识别（LID）为条件的单语模型的性能进行比较。来自多语言模型的解码信息用于语言识别，然后与单语模型结合使用，以改善跨语言的50％WER。我们还提出了一种类似的技术来解决代码切换问题，并在印度英语和孟加拉国英语中分别达到21.77和28.27。我们的工作谈到了如何将基于变压器的ASR尤其是WAV2VEC 2.0应用于开发用于指示语言的多语言ASR和代码转换ASR。

translated by 谷歌翻译

Vakyansh: ASR Toolkit for Low Resource Indic languages

Harveen Singh Chadha , Anirudh Gupta , Priyanshi Shah , Neeraj Chhimwal , Ankur Dhuriya , Rishabh Gaur , Vivek Raghavan

分类：自然语言处理

2022-03-30

我们提出Vakyansh，这是一种用指示语言识别语音识别的端到端工具包。印度拥有近121种语言和大约125亿扬声器。然而，大多数语言在数据和预验证的模型方面都是低资源。通过Vakyansh，我们介绍了自动数据管道，用于数据创建，模型培训，模型评估和部署。我们以23个指示语言和Train Wav2Vec 2.0预验证的模型创建14,000小时的语音数据。然后，对这些预审预告措施的模型进行了修订，以创建18个指示语言的最先进的语音识别模型，其次是语言模型和标点符号修复模型。我们以使命开源所有这些资源，这将激发语音社区使用ASR模型以指示语言开发语音的首次应用程序。

translated by 谷歌翻译

Corpus for Automatic Structuring of Legal Documents

Prathamesh Kalamkar , Aman Tiwari , Astha Agarwal , Saurabh Karn , Smita Gupta , Vivek Raghavan , Ashutosh Modi

分类：自然语言处理 | 人工智能 | 机器学习

2022-01-31

在人口稠密的国家中，悬而未决的法律案件呈指数增长。需要开发处理和组织法律文件的技术。在本文中，我们引入了一个新的语料库来构建法律文件。特别是，我们介绍了用英语的法律判断文件进行的，这些文件被分割为局部和连贯的部分。这些零件中的每一个都有注释，标签来自预定义角色的列表。我们开发基线模型，以根据注释语料库自动预测法律文档中的修辞角色。此外，我们展示了修辞角色在提高总结和法律判断预测任务的绩效方面的应用。我们发布了语料库和基线模型代码以及纸张。

translated by 谷歌翻译

CLSRIL-23: Cross Lingual Speech Representations for Indic Languages

Anirudh Gupta , Harveen Singh Chadha , Priyanshi Shah , Neeraj Chhimwal , Ankur Dhuriya , Rishabh Gaur , Vivek Raghavan

分类：自然语言处理 | 机器学习

2021-07-15

我们介绍了一个CLSRIL-23，一个自我监督的基于学习的音频预训练模型，它学习了来自23个指示语言的原始音频的交叉语言语音表示。它基于Wav2Vec 2.0之上，通过培训蒙面潜在语音表示的对比任务来解决，并共同了解所有语言共享的潜伏的量化。我们在预磨练期间比较语言明智的损失，以比较单机和多语言预制的影响。还比较了一些下游微调任务的表现，并且我们的实验表明，在学习语音表示方面，我们的实验表明，在学习语言的语音表示方面，以及在沿着流的性能方面的学习语音表示。在Hindi中使用多语言预磨模模型时，在WER中观察到5％的减少，9.5％。所有代码模型也都是开放的。 CLSRIL-23是一款以23美元的价格培训的型号，以及近10,000小时的音频数据培训，以促进在语言中的语音识别研究。我们希望将使用自我监督方法创建新的最新状态，特别是对于低资源指示语言。

translated by 谷歌翻译

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Gowtham Ramesh , Sumanth Doddapaneni , Aravinth Bheemaraj , Mayank Jobanputra , Raghavan AK , Ajitesh Sharma , Sujit Sahoo , Harshita Diddee , Mahalakshmi J , Divyanshu Kakwani

分类：自然语言处理

2021-04-12

我们介绍Samanantar，是最大的公开可用的并行Corpora Collection，用于指示语言。该集合中的英语和11个上线语言之间总共包含4970万句对（来自两种语言系列）。具体而言，我们从现有的公共可用并行基层编译1240万句对，另外，从网络上挖掘3740万句对，导致4倍增加。我们通过组合许多语料库，工具和方法来挖掘网站的并行句子：（a）Web爬行单格式语料库，（b）文档OCR，用于从扫描的文档中提取句子，（c）用于对齐句子的多语言表示模型，以及（d）近似最近的邻居搜索搜索大量句子。人类评估新矿业的Corpora的样本验证了11种语言的高质量平行句子。此外，我们使用英语作为枢轴语言，从英式并行语料库中提取所有55个指示语言对之间的834百万句子对。我们培训了跨越Samanantar上所有这些语言的多语种NMT模型，这在公开可用的基准上表现出现有的模型和基准，例如弗洛雷斯，建立萨曼塔尔的效用。我们的数据和模型可在Https://indicnlp.ai4bharat.org/samanantar/上公开提供，我们希望他们能够帮助推进NMT和Multibingual NLP的研究。

translated by 谷歌翻译

Large Language Models Encode Clinical Knowledge

Karan Singhal , Shekoofeh Azizi , Tao Tu , S. Sara Mahdavi , Jason Wei , Hyung Won Chung , Nathan Scales , Ajay Tanwani , Heather Cole-Lewis , Stephen Pfohl

分类：自然语言处理

2022-12-26

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.

translated by 谷歌翻译

Bias Mitigation Framework for Intersectional Subgroups in Neural Networks

Narine Kokhlikyan , Bilal Alsallakh , Fulton Wang , Vivek Miglani , Oliver Aobo Yang , David Adkins

分类：机器学习

2022-12-26

We propose a fairness-aware learning framework that mitigates intersectional subgroup bias associated with protected attributes. Prior research has primarily focused on mitigating one kind of bias by incorporating complex fairness-driven constraints into optimization objectives or designing additional layers that focus on specific protected attributes. We introduce a simple and generic bias mitigation approach that prevents models from learning relationships between protected attributes and output variable by reducing mutual information between them. We demonstrate that our approach is effective in reducing bias with little or no drop in accuracy. We also show that the models trained with our learning framework become causally fair and insensitive to the values of protected attributes. Finally, we validate our approach by studying feature interactions between protected and non-protected attributes. We demonstrate that these interactions are significantly reduced when applying our bias mitigation.

translated by 谷歌翻译