世界各地的隐私法律和法规的景观是复杂而不断变化的。国家和超国家法律,协议,法令和其他政府发行的规则构成了公司必须遵循的拼凑而成才能在国际上进行运作。为了检查该拼凑而成的状态和演变,我们介绍了1,043条隐私法,法规和准则的政府隐私指示语料库或GPI语料库,涵盖了182个司法管辖区。该语料库可以对法律焦点进行大规模定量和定性检查。我们检查了创建GPI的时间分布,并说明了过去50年中隐私立法的急剧增加,尽管较细粒度的检查表明,增加的速度取决于GPIS所说的个人数据类型。我们的探索还表明,大多数隐私法分别解决了相对较少的个人数据类型,这表明全面的隐私立法仍然很少见。此外,主题建模结果显示了GPI中常见主题的普遍性,例如财务,医疗保健和电信。最后,我们将语料库释放到研究界,以促进进一步的研究。
translated by 谷歌翻译
大语言模型的兴起的一个关注点是它们可能造成重大伤害的潜力,尤其是在偏见,淫秽,版权和私人信息方面进行预处理。新兴的道德方法试图过滤预处理的材料,但是这种方法是临时的,未能考虑到上下文。我们提供了一种以法律为基础的过滤方法,该方法直接解决了过滤材料的权衡。首先,我们收集并提供了一堆法律,这是一个256GB(以及增长)的开源英语法律和行政数据数据集,涵盖法院意见,合同,行政规则和立法记录。对一堆法律进行预处理可能有助于解决有望改善司法接触的法律任务。其次,我们提炼政府已制定的法律规范将有毒或私人内容限制为可行的研究人员,并讨论我们的数据集如何反映这些规范。第三,我们展示了一堆法律如何为研究人员提供直接从数据中学习此类过滤规则的机会,从而为基于模型的处理提供了令人兴奋的新研究方向。
translated by 谷歌翻译
我们提出了一种新颖的基准和相关的评估指标,用于评估文本匿名方法的性能。文本匿名化定义为编辑文本文档以防止个人信息披露的任务,目前遭受了面向隐私的带注释的文本资源的短缺,因此难以正确评估各种匿名方法提供的隐私保护水平。本文介绍了标签(文本匿名基准),这是一种新的开源注释语料库,以解决此短缺。该语料库包括欧洲人权法院(ECHR)的1,268个英语法院案件,并充满了有关每个文档中出现的个人信息的全面注释,包括其语义类别,标识符类型,机密属性和共同参考关系。与以前的工作相比,TAB语料库旨在超越传统的识别(仅限于检测预定义的语义类别),并且明确标记了这些文本跨越的标记,这些文本应该被掩盖,以掩盖该人的身份受到保护。除了介绍语料库及其注释层外,我们还提出了一套评估指标,这些指标是针对衡量文本匿名性的性能而定制的,无论是在隐私保护和公用事业保护方面。我们通过评估几个基线文本匿名模型的经验性能来说明基准和提议的指标的使用。完整的语料库及其面向隐私的注释准则,评估脚本和基线模型可在以下网址提供:
translated by 谷歌翻译
创新是经济和社会发展的主要驱动力,有关多种创新的信息嵌入了专利和专利申请的半结构化数据中。尽管在专利数据中表达的创新的影响和新颖性很难通过传统手段来衡量,但ML提供了一套有希望的技术来评估新颖性,汇总贡献和嵌入语义。在本文中,我们介绍了Harvard USPTO专利数据集(HUPD),该数据集是2004年至2004年之间提交给美国专利商业办公室(USPTO)的大型,结构化和多用途的英语专利专利申请。 2018年。HUPD拥有超过450万张专利文件,是可比的Coldia的两到三倍。与以前在NLP中提出的专利数据集不同,HUPD包含了专利申请的发明人提交的版本(不是授予专利的最终版本),其中允许我们在第一次使用NLP方法进行申请时研究专利性。它在包含丰富的结构化元数据以及专利申请文本的同时也很新颖:通过提供每个应用程序的元数据及其所有文本字段,数据集使研究人员能够执行一组新的NLP任务,以利用结构性协变量的变异。作为有关HUPD的研究类型的案例研究,我们向NLP社区(即专利决策的二元分类)介绍了一项新任务。我们还显示数据集中提供的结构化元数据使我们能够对此任务进行概念转移的明确研究。最后,我们演示了如何将HUPD用于三个其他任务:专利主题领域的多类分类,语言建模和摘要。
translated by 谷歌翻译
随着大型语言模型的出现,抽象性摘要的方法取得了长足的进步,从而在应用程序中使用了帮助知识工人处理笨拙的文档收集的潜力。一个这样的环境是民权诉讼交换所(CRLC)(https://clearinghouse.net),其中发布了有关大规模民权诉讼,服务律师,学者和公众的信息。如今,CRLC中的摘要需要对律师和法律专业的学生进行广泛的培训,这些律师和法律专业的学生花费数小时了解多个相关文件,以便产生重要事件和结果的高质量摘要。在这种持续的现实世界摘要工作的激励下,我们引入了Multi-iplesum,这是由正在进行的CRLC写作中绘制的9,280个专家作者的摘要集。鉴于源文档的长度,多文章介绍了一个具有挑战性的多文档摘要任务,通常每个情况超过200页。此外,多胎sum与其多个目标摘要中的其他数据集不同,每个数据集都处于不同的粒度(从一句“极端”摘要到超过五百个单词的多段落叙述)。我们提供了广泛的分析,表明,尽管培训数据(遵守严格的内容和样式准则)中的摘要很高,但最新的摘要模型在此任务上的表现较差。我们发布了多体式的摘要方法,以及促进应用程序的开发,以协助CRLC的任务https://multilexsum.github.io。
translated by 谷歌翻译
在过去的十年中,许多组织制作了旨在从规范意义上进行标准化的文件,并为我们最近和快速的AI开发促进指导。但是,除了一些荟萃分析和该领域的批判性评论外,尚未分析这些文档中提出的思想的全部内容和分歧。在这项工作中,我们试图扩展过去研究人员所做的工作,并创建一种工具,以更好地数据可视化这些文档的内容和性质。我们还提供了通过将工具应用于200个文档的样本量获得的结果的批判性分析。
translated by 谷歌翻译
本文介绍了对土耳其语可用于的语料库和词汇资源的全面调查。我们审查了广泛的资源,重点关注公开可用的资源。除了提供有关可用语言资源的信息外,我们还提供了一组建议,并确定可用于在土耳其语言学和自然语言处理中进行研究和建筑应用的数据中的差距。
translated by 谷歌翻译
在学术界,抄袭肯定不是一个新兴的关注,但它随着互联网的普及和对全球内容来源的易于访问而变得更大的程度,使人类干预不足。尽管如此,由于计算机辅助抄袭检测,抄袭远远远非是一个未被解除的问题,目前是一个有效的研究领域,该研究落在信息检索(IR)和自然语言处理(NLP)领域。许多软件解决方案有助于满足这项任务,本文概述了用于阿拉伯语,法国和英语学术和教育环境的抄袭检测系统。比较在八个系统之间持有,并在检测不同来源的三个混淆水平的特征,可用性,技术方面以及它们的性能之间进行:逐字,释义和跨语言抄袭。在本研究的背景下也进行了对技术形式的抄袭技术形式的关注检查。此外,还提供了对不同作者提出的抄袭类型和分类的调查。
translated by 谷歌翻译
我们提出了多语言开放文本(MOT),这是一种新的多语言语料库,其中包含44种语言的文本,其中许多语言限制了现有的文本资源用于自然语言处理。该语料库的第一个版本包含超过280万篇新闻文章,并在2001 - 2022年之间发表了另外100万个短片段(照片标题,视频描述等),并从美国之声网站收集。我们描述了收集,过滤和处理数据的过程。原始材料在公共领域,我们的收藏品使用Creative Commons许可证(CC By 4.0)获得许可,并且用于创建该语料库的所有软件均在MIT许可证下发布。随着其他文档的发布,该语料库将定期更新。
translated by 谷歌翻译
本文确定了数据驱动系统中的数据最小化和目的限制的两个核心数据保护原理。虽然当代数据处理实践似乎与这些原则的赔率达到差异,但我们证明系统可以在技术上使用的数据远远少于目前的数据。此观察是我们详细的技术法律分析的起点,揭示了妨碍了妨碍了实现的障碍,并举例说明了在实践中应用数据保护法的意外权衡。我们的分析旨在向辩论提供关于数据保护对欧盟人工智能发展的影响,为数据控制员,监管机构和研究人员提供实际行动点。
translated by 谷歌翻译
The number of scientific publications continues to rise exponentially, especially in Computer Science (CS). However, current solutions to analyze those publications restrict access behind a paywall, offer no features for visual analysis, limit access to their data, only focus on niches or sub-fields, and/or are not flexible and modular enough to be transferred to other datasets. In this thesis, we conduct a scientometric analysis to uncover the implicit patterns hidden in CS metadata and to determine the state of CS research. Specifically, we investigate trends of the quantity, impact, and topics for authors, venues, document types (conferences vs. journals), and fields of study (compared to, e.g., medicine). To achieve this we introduce the CS-Insights system, an interactive web application to analyze CS publications with various dashboards, filters, and visualizations. The data underlying this system is the DBLP Discovery Dataset (D3), which contains metadata from 5 million CS publications. Both D3 and CS-Insights are open-access, and CS-Insights can be easily adapted to other datasets in the future. The most interesting findings of our scientometric analysis include that i) there has been a stark increase in publications, authors, and venues in the last two decades, ii) many authors only recently joined the field, iii) the most cited authors and venues focus on computer vision and pattern recognition, while the most productive prefer engineering-related topics, iv) the preference of researchers to publish in conferences over journals dwindles, v) on average, journal articles receive twice as many citations compared to conference papers, but the contrast is much smaller for the most cited conferences and journals, and vi) journals also get more citations in all other investigated fields of study, while only CS and engineering publish more in conferences than journals.
translated by 谷歌翻译
本文介绍了有关开发的原型的研究,以服务公共政策设计的定量研究。政治学的这种子学科着重于确定参与者,之间的关系以及在健康,环境,经济和其他政策方面可以使用的工具。我们的系统旨在自动化收集法律文件,用机构语法注释它们的过程,并使用超图来分析关键实体之间的相互关系。我们的系统经过了《联合国教科文组织公约》的保护,以保护2003年的无形文化遗产,这是一份法律文件,该文件规定了确保文化遗产的国际关系的基本方面。
translated by 谷歌翻译
Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译
While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.
translated by 谷歌翻译
两个关键假设塑造了排名检索的通常视图:(1)搜索者可以为他们希望看到的文档中的疑问选择单词,并且(2)排名检索的文档就足以,因为搜索者将足够就足够了能够认识到他们希望找到的那些。当要搜索的文档处于搜索者未知的语言时,既不是真的。在这种情况下,需要跨语言信息检索(CLIR)。本章审查了艺术技术的交流信息检索,并概述了一些开放的研究问题。
translated by 谷歌翻译
在科学研究中,该方法是解决科学问题和关键研究对象的必不可少手段。随着科学的发展,正在提出,修改和使用许多科学方法。作者在抽象和身体文本中描述了该方法的详细信息,并且反映该方法名称的学术文献中的关键实体称为方法实体。在大量的学术文献中探索各种方法实体有助于学者了解现有方法,为研究任务选择适当的方法并提出新方法。此外,方法实体的演变可以揭示纪律的发展并促进知识发现。因此,本文对方法论和经验作品进行了系统的综述,重点是从全文学术文献中提取方法实体,并努力使用这些提取的方法实体来建立知识服务。首先提出了本综述涉及的关键概念的定义。基于这些定义,我们系统地审查了提取和评估方法实体的方法和指标,重点是每种方法的利弊。我们还调查了如何使用提取的方法实体来构建新应用程序。最后,讨论了现有作品的限制以及潜在的下一步。
translated by 谷歌翻译
The optimal liability framework for AI systems remains an unsolved problem across the globe. In a much-anticipated move, the European Commission advanced two proposals outlining the European approach to AI liability in September 2022: a novel AI Liability Directive and a revision of the Product Liability Directive. They constitute the final, and much-anticipated, cornerstone of AI regulation in the EU. Crucially, the liability proposals and the EU AI Act are inherently intertwined: the latter does not contain any individual rights of affected persons, and the former lack specific, substantive rules on AI development and deployment. Taken together, these acts may well trigger a Brussels effect in AI regulation, with significant consequences for the US and other countries. This paper makes three novel contributions. First, it examines in detail the Commission proposals and shows that, while making steps in the right direction, they ultimately represent a half-hearted approach: if enacted as foreseen, AI liability in the EU will primarily rest on disclosure of evidence mechanisms and a set of narrowly defined presumptions concerning fault, defectiveness and causality. Hence, second, the article suggests amendments, which are collected in an Annex at the end of the paper. Third, based on an analysis of the key risks AI poses, the final part of the paper maps out a road for the future of AI liability and regulation, in the EU and beyond. This includes: a comprehensive framework for AI liability; provisions to support innovation; an extension to non-discrimination/algorithmic fairness, as well as explainable AI; and sustainability. I propose to jump-start sustainable AI regulation via sustainability impact assessments in the AI Act and sustainable design defects in the liability regime. In this way, the law may help spur not only fair AI and XAI, but potentially also sustainable AI (SAI).
translated by 谷歌翻译
当今的冲突变得越来越复杂,流畅和分散,通常涉及许多具有多重且经常发散利益的国家和国际参与者。随着调解员努力使冲突动态有理由,例如冲突政党的范围和政治立场的演变,相关与较少相关的参与者在和平建立和认同之间的区别或身份证明,这一发展构成了冲突调解的重大挑战。关键冲突问题及其相互依存。国际和平努力似乎不足以成功应对这些挑战。尽管技术已经在与冲突相关的领域进行了试验和使用,例如预测冲突或信息收集,但对技术如何促进冲突调解的关注较少。该案例研究有助于有关在冲突调解过程中使用最先进的机器学习技术和技术的新兴研究。本研究使用也门和平谈判中的对话成绩单,通过为他们提供知识管理,提取和冲突分析的工具来有效地支持中介团队。除了说明冲突调解中的机器学习工具的潜力外,本文还强调了跨学科和参与性的共同创造方法对开发上下文敏感和有针对性的工具的重要性,并确保有意义和负责任的实施。
translated by 谷歌翻译
We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda capturing complex computational legal processes, and embedding them in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.
translated by 谷歌翻译