The meaningful use of electronic health records (EHR) continues to progress in the digital era with clinical decision support systems augmented by artificial intelligence. A priority in improving provider experience is to overcome information overload and reduce the cognitive burden so fewer medical errors and cognitive biases are introduced during patient care. One major type of medical error is diagnostic error due to systematic or predictable errors in judgment that rely on heuristics. The potential for clinical natural language processing (cNLP) to model diagnostic reasoning in humans with forward reasoning from data to diagnosis and potentially reduce the cognitive burden and medical error has not been investigated. Existing tasks to advance the science in cNLP have largely focused on information extraction and named entity recognition through classification tasks. We introduce a novel suite of tasks coined as Diagnostic Reasoning Benchmarks, DR.BENCH, as a new benchmark for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation. DR.BENCH is the first clinical suite of tasks designed to be a natural language generation framework to evaluate pre-trained language models. Experiments with state-of-the-art pre-trained generative language models using large general domain models and models that were continually trained on a medical corpus demonstrate opportunities for improvement when evaluated in DR. BENCH. We share DR. BENCH as a publicly available GitLab repository with a systematic approach to load and evaluate models for the cNLP community.
translated by 谷歌翻译
使用自然语言处理方法自动汇总患者的主要进度注释中的主要问题,有助于与医院环境中的信息和认知超负荷作斗争,并可能为提供者提供计算机化的诊断决策支持。问题列表摘要需要一个模型来理解,抽象和生成临床文档。在这项工作中,我们提出了一项新的NLP任务,旨在在住院期间使用提供者进度注释的意见来在患者的日常护理计划中生成一系列问题。我们研究了两个最先进的SEQ2SEQ变压器体系结构T5和Bart的性能,以解决此问题。我们提供了一个基于公开可用的电子健康记录进度注释MART MART(MIMIC)-III中的公开电子健康记录进度注释的语料库。 T5和BART对通用域文本进行了培训,我们尝试了数据增强方法和域适应性预训练方法,以增加医学词汇和知识的接触。评估方法包括胭脂,Bertscore,嵌入句子上的余弦相似性以及对医学概念的F评分。结果表明,与基于规则的系统和通用域预训练的语言模型相比,具有领域自适应预训练的T5可实现显着的性能增长,这表明可以解决问题摘要任务的有希望的方向。
translated by 谷歌翻译
自动问题应答(QA)系统的目的是以时间有效的方式向用户查询提供答案。通常在数据库(或知识库)或通常被称为语料库的文件集合中找到答案。在过去的几十年里,收购知识的扩散,因此生物医学领域的新科学文章一直是指数增长。因此,即使对于领域专家,也难以跟踪域中的所有信息。随着商业搜索引擎的改进,用户可以在某些情况下键入其查询并获得最相关的一小组文档,以及在某些情况下从文档中的相关片段。但是,手动查找所需信息或答案可能仍然令人疑惑和耗时。这需要开发高效的QA系统,该系统旨在为用户提供精确和精确的答案提供了生物医学领域的自然语言问题。在本文中,我们介绍了用于开发普通域QA系统的基本方法,然后彻底调查生物医学QA系统的不同方面,包括使用结构化数据库和文本集合的基准数据集和几种提出的方​​法。我们还探讨了当前系统的局限性,并探索潜在的途径以获得进一步的进步。
translated by 谷歌翻译
There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model - GatorTron - using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on 5 clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve 5 clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og.
translated by 谷歌翻译
The internet has had a dramatic effect on the healthcare industry, allowing documents to be saved, shared, and managed digitally. This has made it easier to locate and share important data, improving patient care and providing more opportunities for medical studies. As there is so much data accessible to doctors and patients alike, summarizing it has become increasingly necessary - this has been supported through the introduction of deep learning and transformer-based networks, which have boosted the sector significantly in recent years. This paper gives a comprehensive survey of the current techniques and trends in medical summarization
translated by 谷歌翻译
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. Today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. There is no standard to evaluate model predictions and reasoning across a breadth of tasks. To address this, we present MultiMedQA, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and HealthSearchQA, a new free-response dataset of medical questions searched online. We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. In addition, we evaluate PaLM (a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Exam questions), surpassing prior state-of-the-art by over 17%. However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLM models for clinical applications.
translated by 谷歌翻译
自然语言处理(NLP)是一个人工智能领域,它应用信息技术来处理人类语言,在一定程度上理解并在各种应用中使用它。在过去的几年中,该领域已经迅速发展,现在采用了深层神经网络的现代变体来从大型文本语料库中提取相关模式。这项工作的主要目的是调查NLP在药理学领域的最新使用。正如我们的工作所表明的那样,NLP是药理学高度相关的信息提取和处理方法。它已被广泛使用,从智能搜索到成千上万的医疗文件到在社交媒体中找到对抗性药物相互作用的痕迹。我们将覆盖范围分为五个类别,以调查现代NLP方法论,常见的任务,相关的文本数据,知识库和有用的编程库。我们将这五个类别分为适当的子类别,描述其主要属性和想法,并以表格形式进行总结。最终的调查介绍了该领域的全面概述,对从业者和感兴趣的观察者有用。
translated by 谷歌翻译
诸如学术文章和商业报告之类的长期文件一直是详细说明重要问题和需要额外关注的复杂主题的标准格式。自动汇总系统可以有效地将长文档置于简短而简洁的文本中,以封装最重要的信息,从而在帮助读者的理解中很重要。最近,随着神经体系结构的出现,已经做出了重大的研究工作,以推动自动文本摘要系统,以及有关将这些系统扩展到长期文档领域的挑战的大量研究。在这项调查中,我们提供了有关长期文档摘要的研究的全面概述,以及其研究环境的三个主要组成部分的系统评估:基准数据集,汇总模型和评估指标。对于每个组成部分,我们在长期汇总的背景下组织文献,并进行经验分析,以扩大有关当前研究进度的观点。实证分析包括一项研究基准数据集的内在特征,摘要模型的多维分析以及摘要评估指标的综述。根据总体发现,我们通过提出可能在这个快速增长的领域中提出未来探索的方向来得出结论。
translated by 谷歌翻译
培训和评估语言模型越来越多地要求构建元数据 - 多样化的策划数据收集,并具有清晰的出处。自然语言提示最近通过将现有的,有监督的数据集转换为多种新颖的预处理任务,突出了元数据策划的好处,从而改善了零击的概括。尽管将这些以数据为中心的方法转化为生物医学语言建模的通用域文本成功,但由于标记的生物医学数据集在流行的数据中心中的代表性大大不足,因此仍然具有挑战性。为了应对这一挑战,我们介绍了BigBio一个由126个以上的生物医学NLP数据集的社区库,目前涵盖12个任务类别和10多种语言。 BigBio通过对数据集及其元数据进行程序化访问来促进可再现的元数据策划,并与当前的平台兼容,以及时工程和端到端的几个/零射击语言模型评估。我们讨论了我们的任务架构协调,数据审核,贡献指南的过程,并概述了两个说明性用例:生物医学提示和大规模,多任务学习的零射门评估。 BigBio是一项持续的社区努力,可在https://github.com/bigscience-workshop/biomedical上获得。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.
translated by 谷歌翻译
健康素养被出现为制定适当的健康决策和确保治疗结果的关键因素。然而,医学术语和该领域的专业语言的复杂结构使健康信息尤为难以解释。因此,迫切需要对自动化方法来提高生物医学文献的可访问性,以提高一般人群。这个问题可以作为医疗保健专业人员语言与公众的语言之间的翻译问题。在本文中,我们介绍了自动化生物医学科学评论的制定语言摘要的新任务,建设了一个数据集,以支持自动化方法的开发和评估,以提高生物医学文献的可访问性。我们对解决这项任务的各种挑战进行了分析,包括不仅对关键要点的总结,而且还概述了对背景知识和专业语言的简化的解释。我们试验最先进的摘要模型以及多种数据增强技术,并使用自动指标和人工评估评估其性能。结果表明,与专家专家专门开发的参考摘要相比,使用当代神经架构产生的自动产生的摘要可以实现有希望的质量和可读性(最佳Rouge-L为50.24和Flesch-Kincaid可读性得分为13.30)。我们还讨论了目前尝试的局限性,为未来工作提供了洞察和方向。
translated by 谷歌翻译
Pretrained transformer models have achieved state-of-the-art results in many tasks and benchmarks recently. Many state-of-the-art Language Models (LMs), however, do not scale well above the threshold of 512 input tokens. In specialized domains though (such as legal, scientific or biomedical), models often need to process very long text (sometimes well above 10000 tokens). Even though many efficient transformers have been proposed (such as Longformer, BigBird or FNet), so far, only very few such efficient models are available for specialized domains. Additionally, since the pretraining process is extremely costly in general - but even more so as the sequence length increases - it is often only in reach of large research labs. One way of making pretraining cheaper is the Replaced Token Detection (RTD) task, by providing more signal during training, since the loss can be computed over all tokens. In this work, we train Longformer models with the efficient RTD task on legal data to showcase that pretraining efficient LMs is possible using much less compute. We evaluate the trained models on challenging summarization tasks requiring the model to summarize long texts to show to what extent the models can achieve good performance on downstream tasks. We find that both the small and base models outperform their baselines on the in-domain BillSum and out-of-domain PubMed tasks in their respective parameter range. We publish our code and models for research purposes.
translated by 谷歌翻译
寻求健康信息的寻求使网络与消费者的健康相关问题淹没了。通常,消费者使用过度描述性和外围信息来表达其医疗状况或其他医疗保健需求,从而有助于自然语言理解的挑战。解决这一挑战的一种方法是总结问题并提取原始问题的关键信息。为了解决此问题,我们介绍了一个新的数据集CHQ-SUMM,其中包含1507个域 - 专家注释的消费者健康问题和相应的摘要。该数据集源自社区提问论坛,因此为了解社交媒体上与消费者健康相关的帖子提供了宝贵的资源。我们在多个最先进的摘要模型上基准测试数据集,以显示数据集的有效性。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
由于临床实践所需的放射学报告和研究是在自由文本叙述中编写和存储的,因此很难提取相对信息进行进一步分析。在这种情况下,自然语言处理(NLP)技术可以促进自动信息提取和自由文本格式转换为结构化数据。近年来,基于深度学习(DL)的模型已适用于NLP实验,并具有令人鼓舞的结果。尽管基于人工神经网络(ANN)和卷积神经网络(CNN)的DL模型具有显着潜力,但这些模型仍面临临床实践中实施的一些局限性。变形金刚是另一种新的DL体系结构,已越来越多地用于改善流程。因此,在这项研究中,我们提出了一种基于变压器的细粒命名实体识别(NER)架构,以进行临床信息提取。我们以自由文本格式收集了88次腹部超声检查报告,并根据我们开发的信息架构进行了注释。文本到文本传输变压器模型(T5)和covive是T5模型的预训练域特异性适应性,用于微调来提取实体和关系,并将输入转换为结构化的格式。我们在这项研究中基于变压器的模型优于先前应用的方法,例如基于Rouge-1,Rouge-2,Rouge-L和BLEU分别为0.816、0.668、0.528和0.743的ANN和CNN模型,同时提供了一个分数可解释的结构化报告。
translated by 谷歌翻译
Recent lay language generation systems have used Transformer models trained on a parallel corpus to increase health information accessibility. However, the applicability of these models is constrained by the limited size and topical breadth of available corpora. We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation. The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset. Furthermore, qualitative evaluation of expert-authored plain language summaries has revealed background explanation as a key strategy to increase accessibility. Such explanation is challenging for neural models to generate because it goes beyond simplification by adding content absent from the source. We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract. We adopt retrieval-augmented models as an intuitive fit for the task of background explanation generation, and show improvements in summary quality and simplicity while maintaining factual correctness. Taken together, this work presents the first comprehensive study of background explanation for lay language generation, paving the path for disseminating scientific knowledge to a broader audience. CELLS is publicly available at: https://github.com/LinguisticAnomalies/pls_retrieval.
translated by 谷歌翻译
每年医生对患者的基于形象的诊断需求越来越大,是最近的人工智能方法可以解决的问题。在这种情况下,我们在医学图像的自动报告领域进行了调查,重点是使用深神经网络的方法,了解:(1)数据集,(2)架构设计,(3)解释性和(4)评估指标。我们的调查确定了有趣的发展,也是留下挑战。其中,目前对生成的报告的评估尤为薄弱,因为它主要依赖于传统的自然语言处理(NLP)指标,这不准确地捕获医疗正确性。
translated by 谷歌翻译
传达相关和忠实信息的能力对于有条件生成的许多任务至关重要,但对于神经SEQ-seq seq模型仍然难以捉摸,这些模型的输出通常显示出幻觉,并且无法正确涵盖重要细节。在这项工作中,我们主张规划作为有用的中间表示,以使有条件的一代减少不透明和扎根。我们的作品提出了将文本计划作为一系列提问(QA)对的新概念化。我们用QA蓝图作为内容选择(即〜说什么)和计划(即〜按什么顺序)来增强现有数据集(例如,用于摘要)。我们通过利用最先进的问题生成技术并将输入输出对自动获取蓝图,并将其转换为输入 - 蓝图输出输出元组。我们开发了基于变压器的模型,每个模型都在它们如何将蓝图合并到生成的输出中(例如,作为全局计划或迭代)。跨指标和数据集的评估表明,蓝图模型比不采取计划并允许对生成输出进行更严格控制的替代方案更为事实。
translated by 谷歌翻译
为了评估任何医疗干预的有效性,研究人员必须进行时间 - 密集和高度手动的文献综述。NLP系统可以帮助自动或协助实现这一昂贵的过程。为了支持这一目标,我们发布MS ^ 2(医学研究的多文件摘要),一个超过470K文档的数据集和来自科学文献的20k摘要。此数据集促进了可以在多项研究中评估和聚合矛盾证据的系统的开发,并且是生物医学领域的第一个大型公开可用的多文件摘要数据集。我们试验基于BART的摘要系统,具有前景的早期结果。我们以自由文本和结构形式制定我们的摘要输入和目标,并修改最近提出的指标,以评估我们系统生成的摘要的质量。数据和模型可在https://github.com/allenai/ms2上获得
translated by 谷歌翻译