智能论文笔记

A Data-driven Latent Semantic Analysis for Automatic Text Summarization using LDA Topic Modelling

Daniel F. O. Onah , Elaine L. L. Pang , Mahmoud El-Haj

分类：机器学习

2022-07-23

随着大数据挖掘和现代大量文本分析的出现和普及，自动化文本摘要在从文档中提取和检索重要信息而变得突出。这项研究从单个和多个文档的角度研究了自动文本摘要的各个方面。摘要是将庞大的文本文章凝结成简短的摘要版本的任务。为了摘要目的，该文本的大小减小，但保留了关键的重要信息并保留原始文档的含义。这项研究介绍了潜在的Dirichlet分配（LDA）方法，用于从具有与基因和疾病有关的主题进行摘要的医学科学期刊文章进行主题建模。在这项研究中，基于Pyldavis Web的交互式可视化工具用于可视化所选主题。可视化提供了主要主题的总体视图，同时允许并将深度含义归因于流行率单个主题。这项研究提出了一种新颖的方法来汇总单个文档和多个文档。结果表明，使用提取性摘要技术在处理后的文档中考虑其主题患病率的概率，纯粹是通过考虑其术语来排名的。 Pyldavis可视化描述了探索主题与拟合LDA模型的术语的灵活性。主题建模结果显示了主题1和2中的流行率。该关联表明，本研究中的主题1和2中的术语之间存在相似性。使用潜在语义分析（LSA）和面向召回的研究测量LDA和提取性摘要方法的功效，以评估模型的可靠性和有效性。

translated by 谷歌翻译

Automatic Related Work Generation: A Meta Study

Xiangci Li , Jessica Ouyang

分类：自然语言处理

2022-01-06

学术研究是解决以前从未解决过的问题的探索活动。通过这种性质，每个学术研究工作都需要进行文献审查，以区分其Novelties尚未通过事先作品解决。在自然语言处理中，该文献综述通常在“相关工作”部分下进行。鉴于研究文件的其余部分和引用的论文列表，自动相关工作生成的任务旨在自动生成“相关工作”部分。虽然这项任务是在10年前提出的，但直到最近，它被认为是作为科学多文件摘要问题的变种。然而，即使在今天，尚未标准化了自动相关工作和引用文本生成的问题。在这项调查中，我们进行了一个元研究，从问题制定，数据集收集，方法方法，绩效评估和未来前景的角度来比较相关工作的现有文献，以便为读者洞察到国家的进步 - 最内容的研究，以及如何进行未来的研究。我们还调查了我们建议未来工作要考虑整合的相关研究领域。

translated by 谷歌翻译

Exploring Optimal Granularity for Extractive Summarization of Unstructured Health Records: Analysis of the Largest Multi-Institutional Archive of Health Records in Japan

Kenichiro Ando , Takashi OkumuraID , Mamoru Komachi , Hiromasa Horiguchi , Yuji Matsumoto

分类：自然语言处理

2022-09-20

临床文本的自动汇总可以减轻医疗专业人员的负担。 “放电摘要”是摘要的一种有希望的应用，因为它们可以从每日住院记录中产生。我们的初步实验表明，放电摘要中有20-31％的描述与住院记录的内容重叠。但是，目前尚不清楚如何从非结构化来源生成摘要。为了分解医师的摘要过程，本研究旨在确定摘要中的最佳粒度。我们首先定义了具有不同粒度的三种摘要单元，以比较放电摘要生成的性能：整个句子，临床段和条款。我们在这项研究中定义了临床细分，旨在表达最小的医学意义概念。为了获得临床细分，有必要在管道的第一阶段自动拆分文本。因此，我们比较了基于规则的方法和一种机器学习方法，而后者在分裂任务中以0.846的F1得分优于构造者。接下来，我们在日本的多机构国家健康记录上，使用三种类型的单元（基于Rouge-1指标）测量了提取性摘要的准确性。使用整个句子，临床段和条款分别为31.91、36.15和25.18的提取性摘要的测量精度分别为31.91、36.15和25.18。我们发现，临床细分的准确性比句子和条款更高。该结果表明，住院记录的汇总需要比面向句子的处理更精细的粒度。尽管我们仅使用日本健康记录，但可以解释如下：医生从患者记录中提取“具有医学意义的概念”并重新组合它们...

translated by 谷歌翻译

Multi-document Summarization via Deep Learning Techniques: A Survey

Congbo Ma , Wei Emma Zhang , Mingyu Guo , Hu Wang , Quan Z. Sheng

分类：自然语言处理 | 机器学习

2020-11-10

多文件摘要（MDS）是信息聚合的有效工具，它从与主题相关文档集群生成信息和简洁的摘要。我们的调查是，首先，系统地概述了最近的基于深度学习的MDS模型。我们提出了一种新的分类学，总结神经网络的设计策略，并进行全面的最先进的概要。我们突出了在现有文献中很少讨论的各种客观函数之间的差异。最后，我们提出了与这个新的和令人兴奋的领域有关的几个方向。

translated by 谷歌翻译

An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics

Huan Yee Koh , Jiaxin Ju , Ming Liu , Shirui Pan

分类：自然语言处理

2022-07-03

诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中，以封装最重要的信息，从而在帮助读者的理解中很重要。最近，随着神经体系结构的出现，已经做出了重大的研究工作，以推动自动文本摘要系统，以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中，我们提供了有关长期文档摘要的研究的全面概述，以及其研究环境的三个主要组成部分的系统评估：基准数据集，汇总模型和评估指标。对于每个组成部分，我们在长期汇总的背景下组织文献，并进行经验分析，以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征，摘要模型的多维分析以及摘要评估指标的综述。根据总体发现，我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。

translated by 谷歌翻译

Implementing Deep Learning-Based Approaches for Article Summarization in Indian Languages

Rahul Tangsali , Aabha Pingle , Aditya Vyawahare , Isha Joshi , Raviraj Joshi

分类：自然语言处理 | 机器学习

2022-12-12

The research on text summarization for low-resource Indian languages has been limited due to the availability of relevant datasets. This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets. The ISUM 2022 dataset consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their ground-truth summarizations. In our work, we explore different pre-trained seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case, the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS model along with a translation mapping-based approach for Gujarati. Our scores on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4 as the evaluation metrics.

translated by 谷歌翻译

Adaptive Summaries: A Personalized Concept-based Summarization Approach by Learning from Users' Feedback

Samira Ghodratnama , Mehrdad Zakershahrak , Fariborz Sobhanmanesh

分类：人工智能

2020-12-24

有效地探索巨大的数据，以做出决定，类似于回答复杂的问题，是挑战许多现实世界应用场景。在这种情况下，自动摘要具有重要的重要性，因为它将为大数据分析提供基础。传统的摘要方法优化系统以产生短暂的静态摘要，适合所有不考虑概述主观性方面的用户，即对不同用户认为有价值的用户，使这些方法在现实世界使用情况下不切实际。本文提出了一种基于互动概念的摘要模型，称为自适应摘要，可帮助用户制作所需的摘要，而不是产生单一的不灵活的摘要。系统通过在迭代循环中提供反馈来逐渐从用户提供信息，同时与系统交互。用户可以选择拒绝或接受概述中包含概念的操作，以从用户的透视和反馈的置信界面的重要性。所提出的方法可以保证交互式速度，以防止用户从事该过程。此外，它消除了对参考摘要的需求，这对于总结任务来说是一个具有挑战性的问题。评估表明，自适应摘要可帮助用户通过最大化所生成的摘要中的用户期望的内容来基于它们的偏好来使高质量的摘要。

translated by 谷歌翻译

Indian Legal Text Summarization: A Text Normalisation-based Approach

Satyajit Ghosh , Mousumi Dutta , Tanaya Das

分类：自然语言处理

2022-06-13

在印度法院制度中，长期以来一直是一个问题。有超过4千万的案件。对于法律利益相关者来说，手动总结数百个文件是一项耗时且繁琐的任务。随着机器学习的发展，许多用于文本摘要的最新模型已经出现。独立于域的模型在法律文本方面做得不好，由于缺乏公开可用的数据集，对印度法律制度的这些模型进行微调是有问题的。为了提高独立模型的性能，作者提出了一种在印度背景下使法律文本正常化的方法。作者试验了两个与法律文本摘要的最先进的域独立模型，即Bart和Pegasus。 Bart和Pegasus以提取性和抽象的摘要为方面，以了解文本归一化方法的有效性。汇总文本由域专家在多个参数和使用胭脂指标上评估。它表明，在具有域独立模型的法律文本中，提出的文本归一化方法有效。

translated by 谷歌翻译

A Survey on Medical Document Summarization

Raghav Jain , Anubhav Jangra , Sriparna Saha , Adam Jatowt

分类：自然语言处理

2022-12-03

The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been supported through the introduction of deep learning and transformer-based networks, which have boosted the sector significantly in recent years. This paper gives a comprehensive survey of the current techniques and trends in medical summarization

translated by 谷歌翻译

Computing and Exploiting Document Structure to Improve Unsupervised Extractive Summarization of Legal Case Decisions

Yang Zhong , Diane Litman

分类：自然语言处理

2022-11-06

Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case summarization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.

translated by 谷歌翻译

Graph-based Semantical Extractive Text Analysis

Mina Samizadeh

分类：自然语言处理 | 机器学习

2022-12-19

In the past few decades, there has been an explosion in the amount of available data produced from various sources with different topics. The availability of this enormous data necessitates us to adopt effective computational tools to explore the data. This leads to an intense growing interest in the research community to develop computational methods focused on processing this text data. A line of study focused on condensing the text so that we are able to get a higher level of understanding in a shorter time. The two important tasks to do this are keyword extraction and text summarization. In keyword extraction, we are interested in finding the key important words from a text. This makes us familiar with the general topic of a text. In text summarization, we are interested in producing a short-length text which includes important information about the document. The TextRank algorithm, an unsupervised learning method that is an extension of the PageRank (algorithm which is the base algorithm of Google search engine for searching pages and ranking them) has shown its efficacy in large-scale text mining, especially for text summarization and keyword extraction. this algorithm can automatically extract the important parts of a text (keywords or sentences) and declare them as the result. However, this algorithm neglects the semantic similarity between the different parts. In this work, we improved the results of the TextRank algorithm by incorporating the semantic similarity between parts of the text. Aside from keyword extraction and text summarization, we develop a topic clustering algorithm based on our framework which can be used individually or as a part of generating the summary to overcome coverage problems.

translated by 谷歌翻译

WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs

Hoang Thang Ta , Abu Bakar Siddiqur Rahman , Navonil Majumder , Amir Hussain , Lotfollah Najjar , Newton Howard , Soujanya Poria , Alexander Gelbukh

分类：自然语言处理

2022-09-27

由于免费的在线百科全书具有大量内容，因此Wikipedia和Wikidata是许多自然语言处理（NLP）任务的关键，例如信息检索，知识基础构建，机器翻译，文本分类和文本摘要。在本文中，我们介绍了Wikides，这是一个新颖的数据集，用于为文本摘要问题提供Wikipedia文章的简短描述。该数据集由6987个主题上的80K英语样本组成。我们设置了一种两阶段的摘要方法 - 描述生成（I阶段）和候选排名（II阶段）作为一种依赖于转移和对比学习的强大方法。对于描述生成，与其他小规模的预训练模型相比，T5和BART表现出了优越性。通过将对比度学习与Beam Search的不同输入一起应用，基于度量的排名模型优于直接描述生成模型，在主题独立拆分和独立于主题的独立拆分中，最高可达22个胭脂。此外，第II期中的结果描述得到了人类评估的支持，其中45.33％以上，而I阶段的23.66％则支持针对黄金描述。在情感分析方面，生成的描述无法有效地从段落中捕获所有情感极性，同时从黄金描述中更好地完成此任务。自动产生的新描述减少了人类为创建它们的努力，并丰富了基于Wikidata的知识图。我们的论文对Wikipedia和Wikidata产生了实际影响，因为有成千上万的描述。最后，我们预计Wikides将成为从短段落中捕获显着信息的相关作品的有用数据集。策划的数据集可公开可用：https：//github.com/declare-lab/wikides。

translated by 谷歌翻译

Transforming Wikipedia into Augmented Data for Query-Focused Summarization

Haichao Zhu , Li Dong , Furu Wei , Bing Qin , Ting Liu

分类：自然语言处理

2019-11-08

现有以查询为中心的摘要数据集的大小有限，使培训数据驱动的摘要模型提出了挑战。同时，以查询为重点的摘要语料库的手动构造昂贵且耗时。在本文中，我们使用Wikipedia自动收集超过280，000个示例的大型以查询为中心的摘要数据集（名为Wikiref），这可以用作数据增强的手段。我们还开发了一个基于BERT的以查询为重点的摘要模型（Q-bert），以从文档中提取句子作为摘要。为了更好地调整包含数百万个参数的巨大模型，我们仅识别和微调一个稀疏的子网络，这对应于整个模型参数的一小部分。三个DUC基准测试的实验结果表明，在Wikiref中预先培训的模型已经达到了合理的性能。在对特定基准数据集进行了微调后，具有数据增强的模型优于强大比较系统。此外，我们提出的Q-Bert模型和子网微调都进一步改善了模型性能。该数据集可在https://aka.ms/wikiref上公开获取。

translated by 谷歌翻译

CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries

Xiaojun Liu , Shunan Zang , Chuang Zhang , Xiaojun Chen , Yangyang Ding

分类：自然语言处理 | 人工智能

2022-06-09

缺乏创造力的抽象方法在自动文本摘要中尤其是一个问题。模型产生的摘要主要是从源文章中提取的。该问题的主要原因之一是缺乏抽象性的数据集，尤其是对于中文而言。为了解决这个问题，我们用CLT中的参考摘要解释，中国长文本摘要数据集，正确的事实不一致的错误，并提出了第一个中国长文本摘要数据集，其中包含高度的clts+，其中包含超过更多的中文。 180k文章 - 苏格尔对，可在线购买。此外，我们引入了一个基于共发生词的固有度量，以评估我们构建的数据集。我们对CLTS+摘要中使用的提取策略进行了针对其他数据集的提取策略，以量化我们的新数据的抽象性和难度，并在CLTS+上训练多个基线，以验证IT的实用性以提高模型的创造力。

translated by 谷歌翻译

Pointer over Attention: An Improved Bangla Text Summarization Approach Using Hybrid Pointer Generator Network

Nobel Dhar , Gaurob Saha , Prithwiraj Bhattacharjee , Avi Mallick , Md Saiful Islam

分类：自然语言处理 | 机器学习

2021-11-19

尽管具有抽象文本摘要的神经序列到序列模型的成功，但它具有一些缺点，例如重复不准确的事实细节并倾向于重复自己。我们提出了一个混合指针发生器网络，以解决再现事实细节的缺点和短语重复。我们使用混合指针发生器网络增强了基于注意的序列到序列，该混合指针发生器网络可以生成词汇单词并增强再现真实细节的准确性和劝阻重复的覆盖机制。它产生合理的输出文本，可以保留输入文章的概念完整性和事实信息。为了评估，我们主要雇用“百拉那” - 一个高度采用的公共孟加拉数据集。此外，我们准备了一个名为“BANS-133”的大型数据集，由133K Bangla新闻文章组成，与人类生成的摘要相关。试验拟议的模型，我们分别实现了胭脂-1和胭脂 - 2分别为0.66,0.41的“Bansdata”数据集，分别为0.67,0.42，为Bans-133k“数据集。我们证明了所提出的系统超过以前的国家 - 近距离数据集的近距离攀义概要技术及其稳定性。“Bans-133”数据集和代码基础将公开进行研究。

translated by 谷歌翻译

Comparing Methods for Extractive Summarization of Call Centre Dialogue

Alexandra N. Uma , Dmitry Sityaev

分类：自然语言处理 | 人工智能

2022-09-06

本文提供了评估一些文本摘要技术的结果，目的是为联系中心解决方案生产呼叫摘要。我们特别关注提取性摘要方法，因为它们不需要任何标记的数据，并且非常易于实施生产使用。我们通过使用这些方法来比较几种此类方法来对呼叫的摘要进行比较，并客观地（使用Rouge-L）和主观（通过汇总几个注释者的判断）来评估这些摘要。我们发现主题和铅-N的表现优于其他摘要方法，而Bertsum在主观和客观评估中的得分相对较低。结果表明，即使是基于启发式方法的方法，例如Lead-n Ca n也会产生有意义且有用的呼叫中心对话摘要。

translated by 谷歌翻译

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

Mohamad Yaser Jaradeh , Markus Stocker , Sören Auer

分类：自然语言处理

2022-12-11

Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles. Our approach condenses the article's full-text to property-value pairs as a segmented text snippet called structured summary. We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles, which we openly publish as a resource for the research community. Our results show that structured summarization is a suitable approach for targeted information extraction that complements other commonly used methods such as question answering and named entity recognition.

translated by 谷歌翻译

Textrank: Bringing order into text

分类：

In this paper, we introduce TextRank -a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications. In particular, we propose two innovative unsupervised methods for keyword and sentence extraction, and show that the results obtained compare favorably with previously published results on established benchmarks.

translated by 谷歌翻译

Klexikon: A German Dataset for Joint Summarization and Simplification

Dennis Aumiller , Michael Gertz

分类：自然语言处理

2022-01-18

传统上，文本简化被视为单语翻译任务，其中源文本及其简化的对应物之间的句子是对齐的。但是，尤其是对于更长的输入文档，总结文本（或完全删除相关内容）在简化过程中起重要作用，目前在现有数据集中尚未反映出该过程。同时，非英语语言的资源通常很少，并且对于培训新解决方案而言是过分的。为了解决这个问题，我们对可以共同总结和简化长源文档的系统提出了核心要求。我们进一步描述了基于德国Wikipedia和德国儿童词典“ Klexikon”的新数据集的创建，用于简化和摘要，包括近2900个文档。我们发布了一个与文档一致的版本，特别突出了摘要方面，并提供了统计证据，表明此资源也非常适合简化。代码和数据可在GitHub上找到：https：//github.com/dennlinger/klexikon

translated by 谷歌翻译

CELLS: A Parallel Corpus for Biomedical Lay Language Generation

Yue Guo , Wei Qiu , Gondy Leroy , Sheng Wang , Trevor Cohen

分类：自然语言处理

2022-11-07

Recent lay language generation systems have used Transformer models trained on a parallel corpus to increase health information accessibility. However, the applicability of these models is constrained by the limited size and topical breadth of available corpora. We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset. Furthermore, qualitative evaluation of expert-authored plain language summaries has revealed background explanation as a key strategy to increase accessibility. Such explanation is challenging for neural models to generate because it goes beyond simplification by adding content absent from the source. We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract. We adopt retrieval-augmented models as an intuitive fit for the task of background explanation generation, and show improvements in summary quality and simplicity while maintaining factual correctness. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. CELLS is publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.

translated by 谷歌翻译