智能论文笔记

Detecting Logical Relation In Contract Clauses

Alexandre Yukio Ichida , Felipe Meneguzzi

分类：人工智能

2021-11-02

合同利益，大多数现代商业交易界定在协议中界定了相关方的职责和义务。确保此类协议是免费的，对于现代社会至关重要，他们对合同的分析需要了解条款之间的逻辑关系并确定潜在矛盾。这种分析取决于易于忽视每个合同条款的人力努力。在这项工作中，我们制定了一种自动化合同中条款之间逻辑关系的方法。我们将此问题作为自然语言推理任务，以检测合同中的两个条款之间的征集类型。由此产生的方法应该帮助合同作者检测条款之间的潜在逻辑冲突。

translated by 谷歌翻译

Developing neural machine translation models for Hungarian-English

Attila Nagy

分类：自然语言处理 | 机器学习

2021-11-07

我用Hunglish2语料库训练神经电脑翻译任务的模型。这项工作的主要贡献在培训NMT模型期间评估不同的数据增强方法。我提出了5种不同的增强方法，这些方法是结构感知的，这意味着而不是随机选择用于消隐或替换的单词，句子的依赖树用作增强的基础。我首先关于神经网络的详细文献综述，顺序建模，神经机翻译，依赖解析和数据增强。经过详细的探索性数据分析和Hunglish2语料库的预处理之后，我使用所提出的数据增强技术进行实验。匈牙利语的最佳型号达到了33.9的BLEU得分，而英国匈牙利最好的模型达到了28.6的BLEU得分。

translated by 谷歌翻译

Syntactic Inductive Biases for Deep Learning Methods

Yikang Shen

分类：机器学习 | 人工智能

2022-06-08

在本文中，我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族，一个家庭用于选区结构，另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位（或神经元）分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法，可以从顺序输入中构建潜在的层次表示形式，即更高级别的表示由高级表示形式组成，并且可以分解为一系列低级表示。例如，在不了解地面实际结构的情况下，我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面，依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言，潜在关系通常被建模为一个定向依赖图，其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后，我们发现该模型能够诱导接近人类专家注释的有向图，并且在不同任务上也优于标准变压器模型。我们认为，这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。

translated by 谷歌翻译

A Survey of Natural Language Generation

Chenhe Dong , Yinghui Li , Haifan Gong , Miaoxin Chen , Junxin Li , Ying Shen , Min Yang

分类：自然语言处理 | 人工智能 | 机器学习

2021-12-22

本文对过去二十年来对自然语言生成（NLG）的研究提供了全面的审查，特别是与数据到文本生成和文本到文本生成深度学习方法有关，以及NLG的新应用技术。该调查旨在（a）给出关于NLG核心任务的最新综合，以及该领域采用的建筑;（b）详细介绍各种NLG任务和数据集，并提请注意NLG评估中的挑战，专注于不同的评估方法及其关系;（c）强调一些未来的强调和相对近期的研究问题，因为NLG和其他人工智能领域的协同作用而增加，例如计算机视觉，文本和计算创造力。

translated by 谷歌翻译

A Multi-level Neural Network for Implicit Causality Detection in Web Texts

Shining Liang , Wanli Zuo , Zhenkun Shi , Sen Wang , Junhu Wang , Xianglin Zuo

分类：自然语言处理 | 人工智能 | 机器学习

2019-08-18

来自文本的采矿因果关系是一种复杂的和至关重要的自然语言理解任务，对应于人类认知。其解决方案的现有研究可以分为两种主要类别：基于特征工程和基于神经模型的方法。在本文中，我们发现前者具有不完整的覆盖范围和固有的错误，但提供了先验知识;虽然后者利用上下文信息，但其因果推断不足。为了处理限制，我们提出了一个名为MCDN的新型因果关系检测模型，明确地模拟因果关系，而且，利用两种方法的优势。具体而言，我们采用多头自我关注在Word级别获得语义特征，并在段级别推断出来的SCRN。据我们所知，关于因果关系任务，这是第一次应用关系网络。实验结果表明：1）该方法对因果区检测进行了突出的性能; 2）进一步分析表现出MCDN的有效性和稳健性。

translated by 谷歌翻译

A large annotated corpus for learning natural language inference

Samuel R. Bowman , Gabor Angeli , Christopher Potts , Christopher D. Manning

分类：

2015-08-21

Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations. However, machine learning research in this area has been dramatically limited by the lack of large-scale resources. To address this, we introduce the Stanford Natural Language Inference corpus, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. At 570K pairs, it is two orders of magnitude larger than all other resources of its type. This increase in scale allows lexicalized classifiers to outperform some sophisticated existing entailment models, and it allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

translated by 谷歌翻译

Paraphrase Identification with Deep Learning: A Review of Datasets and Methods

Chao Zhou , Cheng Qiu , Daniel E. Acuna

分类：自然语言处理 | 人工智能

2022-12-13

The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI.

translated by 谷歌翻译

Actuarial Applications of Natural Language Processing Using Transformers: Case Studies for Using Text Features in an Actuarial Context

Andreas Troxler , Jürg Schelldorfer

分类：自然语言处理

2022-06-04

本教程展示了工作流程，将文本数据纳入精算分类和回归任务。主要重点是采用基于变压器模型的方法。平均长度为400个单词的车祸描述的数据集，英语和德语可用，以及具有简短财产保险索赔的数据集用来证明这些技术。案例研究应对与多语言环境和长输入序列有关的挑战。他们还展示了解释模型输出，评估和改善模型性能的方法，通过将模型调整到应用程序领域或特定预测任务。最后，该教程提供了在没有或仅有少数标记数据的情况下处理分类任务的实用方法。通过使用最少的预处理和微调的现成自然语言处理（NLP）模型的语言理解技能（NLP）模型实现的结果清楚地证明了用于实际应用的转移学习能力。

translated by 谷歌翻译

Text classification in shipping industry using unsupervised models and Transformer based supervised models

Ying Xie , Dongping Song

分类：自然语言处理 | 机器学习

2022-12-21

Obtaining labelled data in a particular context could be expensive and time consuming. Although different algorithms, including unsupervised learning, semi-supervised learning, self-learning have been adopted, the performance of text classification varies with context. Given the lack of labelled dataset, we proposed a novel and simple unsupervised text classification model to classify cargo content in international shipping industry using the Standard International Trade Classification (SITC) codes. Our method stems from representing words using pretrained Glove Word Embeddings and finding the most likely label using Cosine Similarity. To compare unsupervised text classification model with supervised classification, we also applied several Transformer models to classify cargo content. Due to lack of training data, the SITC numerical codes and the corresponding textual descriptions were used as training data. A small number of manually labelled cargo content data was used to evaluate the classification performances of the unsupervised classification and the Transformer based supervised classification. The comparison reveals that unsupervised classification significantly outperforms Transformer based supervised classification even after increasing the size of the training dataset by 30%. Lacking training data is a key bottleneck that prohibits deep learning models (such as Transformers) from successful practical applications. Unsupervised classification can provide an alternative efficient and effective method to classify text when there is scarce training data.

translated by 谷歌翻译

A text autoencoder from transformer for fast encoding language representation

Tan Huang

分类：自然语言处理 | 人工智能

2021-11-04

近年来BERT显示明显的优势，在自然语言处理任务的巨大潜力。然而，培训和应用BERT需要计算上下文语言表示，这阻碍了它的普遍性和适用性密集的时间和资源。为了克服这个瓶颈，我们采用窗口屏蔽机制立正层提出了深刻的双向语言模型。这项工作计算上下文的语言表示，而没有随意屏蔽一样在BERT和保持深双向架构类似BERT。为了计算相同的句子表示，我们的方法显示出O（n）的复杂性相比少其他基于变压器的型号O（$ N ^ $ 2）。为了进一步显示其优越性，计算在CPU环境背景下的语言表述中进行，通过短信分类方面使用的嵌入，从所提出的方法，logistic回归显示更高的精度。 Moverover，所提出的方法也实现了语义相似任务显著更高的性能。

translated by 谷歌翻译

Integrating Linguistic Theory and Neural Language Models

Bai Li

分类：自然语言处理

2022-07-20

基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是，通常通过利用大量培训数据来实现排行榜的性能，并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中，我介绍了几个案例研究，以说明理论语言学和神经语言模型仍然相互关联。首先，语言模型通过提供一个客观的工具来测量语义距离，这对语言学家很有用，语义距离很难使用传统方法。另一方面，语言理论通过提供框架和数据源来探究我们的语言模型，以了解语言理解的特定方面，从而有助于语言建模研究。本论文贡献了三项研究，探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中，我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源，我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中，我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明，包含形态句法异常的句子触发了语言模型早期的惊喜，而不是语义和常识异常。最后，在论文的第三部分中，我适应了一些心理语言学研究，以表明语言模型包含了论证结构结构的知识。总而言之，我的论文在自然语言处理，语言理论和心理语言学之间建立了新的联系，以为语言模型的解释提供新的观点。

translated by 谷歌翻译

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy , Samuel R. Bowman

分类：

2018-04-20

For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.

translated by 谷歌翻译

Effective General-Domain Data Inclusion for the Machine Translation Task by Vanilla Transformers

Hassan Soliman

分类：自然语言处理

2022-09-28

机器翻译历史上的重要突破之一是变压器模型的发展。不仅对于各种翻译任务，而且对于大多数其他NLP任务都是革命性的。在本文中，我们针对一个基于变压器的系统，该系统能够将德语用源句子转换为其英语的对应目标句子。我们对WMT'13数据集的新闻评论德语 - 英语并行句子进行实验。此外，我们研究了来自IWSLT'16数据集的培训中包含其他通用域数据以改善变压器模型性能的效果。我们发现，在培训中包括IWSLT'16数据集，有助于在WMT'13数据集的测试集中获得2个BLEU得分点。引入定性分析以分析通用域数据的使用如何有助于提高产生的翻译句子的质量。

translated by 谷歌翻译

MICE: Mining Idioms with Contextual Embeddings

Tadej Škvorc , Polona Gantar , Marko Robnik-Šikonja

分类：自然语言处理 | 机器学习

2020-08-13

对于自然语言处理应用可能是有问题的，因为它们的含义不能从其构成词语推断出来。缺乏成功的方法方法和足够大的数据集防止了用于检测成语的机器学习方法的开发，特别是对于在训练集中不发生的表达式。我们提出了一种叫做小鼠的方法，它使用上下文嵌入来实现此目的。我们展示了一个新的多字表达式数据集，具有文字和惯用含义，并使用它根据两个最先进的上下文单词嵌入式培训分类器：Elmo和Bert。我们表明，使用两个嵌入式的深度神经网络比现有方法更好地执行，并且能够检测惯用词使用，即使对于训练集中不存在的表达式。我们展示了开发模型的交叉传输，并分析了所需数据集的大小。

translated by 谷歌翻译

ArNLI: Arabic Natural Language Inference for Entailment and Contradiction Detection

Khloud Al Jallad , Nada Ghneim

分类：自然语言处理 | 人工智能 | 机器学习

2022-09-28

自然语言推论（NLI）是自然语言处理中的热门话题研究，句子之间的矛盾检测是NLI的特殊情况。这被认为是一项困难的NLP任务，当在许多NLP应用程序中添加为组件时，其影响很大，例如问答系统，文本摘要。阿拉伯语是由于其丰富的词汇，语义歧义而检测矛盾的最具挑战性的低资源语言之一。我们创建了一个超过12K句子的数据集并命名为Arnli，这将是公开可用的。此外，我们采用了一种新的模型，该模型受到斯坦福大学矛盾检测的启发，提出了有关英语的解决方案。我们提出了一种方法，以使用矛盾向量与语言模型向量作为机器学习模型的输入来检测阿拉伯语对句子之间的矛盾。我们分析了不同传统的机器学习分类器的结果，并比较了他们在创建的数据集（Arnli）和Pheme，病态的英语数据集的自动翻译上进行了比较。使用随机森林分类器，精度为99％，60％和75％的Pheme，Sick和Arnli的最佳结果。

translated by 谷歌翻译

Paying Attention to Astronomical Transients: Introducing the Time-series Transformer for Photometric Classification

Tarek Allam Jr. , Jason D. McEwen

分类：机器学习

2021-05-13

Future surveys such as the Legacy Survey of Space and Time (LSST) of the Vera C. Rubin Observatory will observe an order of magnitude more astrophysical transient events than any previous survey before. With this deluge of photometric data, it will be impossible for all such events to be classified by humans alone. Recent efforts have sought to leverage machine learning methods to tackle the challenge of astronomical transient classification, with ever improving success. Transformers are a recently developed deep learning architecture, first proposed for natural language processing, that have shown a great deal of recent success. In this work we develop a new transformer architecture, which uses multi-head self attention at its core, for general multi-variate time-series data. Furthermore, the proposed time-series transformer architecture supports the inclusion of an arbitrary number of additional features, while also offering interpretability. We apply the time-series transformer to the task of photometric classification, minimising the reliance of expert domain knowledge for feature selection, while achieving results comparable to state-of-the-art photometric classification methods. We achieve a logarithmic-loss of 0.507 on imbalanced data in a representative setting using data from the Photometric LSST Astronomical Time-Series Classification Challenge (PLAsTiCC). Moreover, we achieve a micro-averaged receiver operating characteristic area under curve of 0.98 and micro-averaged precision-recall area under curve of 0.87.

translated by 谷歌翻译

TNT-KID: Transformer-based Neural Tagger for Keyword Identification

Matej Martinc , Blaž Škrlj , Senja Pollak

分类：自然语言处理

2020-03-20

随着越来越多的可用文本数据，能够自动分析，分类和摘要这些数据的算法的开发已成为必需品。在本研究中，我们提出了一种用于关键字识别的新颖算法，即表示给定文档的关键方面的一个或多字短语的提取，称为基于变压器的神经标记器，用于关键字识别（TNT-KID）。通过将变压器架构适用于手头的特定任务并利用域特定语料库上的预先磨损的语言模型，该模型能够通过提供竞争和强大的方式克服监督和无监督的最先进方法的缺陷在各种不同的数据集中的性能，同时仅需要最佳执行系统所需的手动标记的数据。本研究还提供了彻底的错误分析，具有对模型内部运作的有价值的见解和一种消融研究，测量关键字识别工作流程的特定组分对整体性能的影响。

translated by 谷歌翻译

Deep Learning Driven Natural Languages Text to SQL Query Conversion: A Survey

Ayush Kumar , Parth Nagarkar , Prabhav Nalhe , Sanjeev Vijayakumar

分类：自然语言处理 | 人工智能

2022-08-08

随着未来以数据为中心的决策，对数据库的无缝访问至关重要。关于创建有效的文本到SQL（Text2SQL）模型以访问数据库的数据有广泛的研究。使用自然语言是可以通过有效访问数据库（尤其是对于非技术用户）来弥合数据和结果之间差距的最佳接口之一。它将打开门，并在精通技术技能或不太熟练的查询语言的用户中引起极大的兴趣。即使提出或研究了许多基于深度学习的算法，在现实工作场景中使用自然语言来解决数据查询问题仍然非常具有挑战性。原因是在不同的研究中使用不同的数据集，这带来了其局限性和假设。同时，我们确实缺乏对这些提议的模型及其对其训练的特定数据集的局限性的彻底理解。在本文中，我们试图介绍过去几年研究的24种神经网络模型的整体概述，包括其涉及卷积神经网络，经常性神经网络，指针网络，强化学习，生成模型等的架构。我们还概述11个数据集，这些数据集被广泛用于训练Text2SQL技术的模型。我们还讨论了无缝数据查询中文本2SQL技术的未来应用可能性。

translated by 谷歌翻译

Unsupervised Law Article Mining based on Deep Pre-Trained Language Representation Models with Application to the Italian Civil Code

Andrea Tagarelli , Andrea Simeri

分类：自然语言处理 | 人工智能

2021-12-02

建模法检索和检索作为预测问题最近被出现为法律智能的主要方法。专注于法律文章检索任务，我们展示了一个名为Lamberta的深度学习框架，该框架被设计用于民法代码，并在意大利民法典上专门培训。为了我们的知识，这是第一项研究提出了基于伯特（来自变压器的双向编码器表示）学习框架的意大利法律制度对意大利法律制度的高级法律文章预测的研究，最近引起了深度学习方法的增加，呈现出色的有效性在几种自然语言处理和学习任务中。我们通过微调意大利文章或其部分的意大利预先训练的意大利预先训练的伯爵来定义Lamberta模型，因为法律文章作为分类任务检索。我们Lamberta框架的一个关键方面是我们构思它以解决极端的分类方案，其特征在于课程数量大，少量学习问题，以及意大利法律预测任务的缺乏测试查询基准。为了解决这些问题，我们为法律文章的无监督标签定义了不同的方法，原则上可以应用于任何法律制度。我们提供了深入了解我们Lamberta模型的解释性和可解释性，并且我们对单一标签以及多标签评估任务进行了广泛的查询模板实验分析。经验证据表明了Lamberta的有效性，以及对广泛使用的深度学习文本分类器和一些构思的几次学习者来说，其优越性是对属性感知预测任务的优势。

translated by 谷歌翻译

Enhanced word embeddings using multi-semantic representation through lexical chains

Terry Ruas , Charles Henrique Porto Ferreira , William Grosky , Fabrício Olivetti de França , Débora Maria Rossi Medeiros

分类：自然语言处理 | 机器学习

2021-01-22

The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in the document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.

translated by 谷歌翻译