智能论文笔记

Sharing Linkable Learning Objects with the use of Metadata and a Taxonomy Assistant for Categorization

Valentina Franzoni , Sergio Tasso , Simonetta Pallottelli , Damiano Perri

分类：人工智能

2022-12-09

In this work, a re-design of the Moodledata module functionalities is presented to share learning objects between e-learning content platforms, e.g., Moodle and G-Lorep, in a linkable object format. The e-learning courses content of the Drupal-based Content Management System G-Lorep for academic learning is exchanged designing an object incorporating metadata to support the reuse and the classification in its context. In such an Artificial Intelligence environment, the exchange of Linkable Learning Objects can be used for dialogue between Learning Systems to obtain information, especially with the use of semantic or structural similarity measures to enhance the existent Taxonomy Assistant for advanced automated classification.

translated by 谷歌翻译

LiveSchema: A Gateway Towards Learning on Knowledge Graph Schemas

Mattia Fumagalli , Marco Boffo , Daqian Shi , Mayukh Bagchi , Fausto Giunchiglia

分类：人工智能

2022-07-13

科学家在寻找最佳的输入资源来解决目标预测任务的最佳输入资源方面的困难是在知识图图图上训练算法的主要障碍之一。除此之外，一个关键的挑战是确定如何操纵（和嵌入）这些数据，这些数据通常以特定的三元组（即主题，谓词，对象）的形式来启用学习过程。在本文中，我们描述了Liveschema倡议，即一个门户，该网关提供了一个服务家庭，可以轻松访问，分析，转换和利用知识图模式，其主要目标是促进这些资源在机器学习用例中的重复使用。作为该计划的早期实施，我们还推进了一个在线目录，该目录依赖于800多个资源，并提供了第一组示例服务。

translated by 谷歌翻译

A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

Mehdi Abdelhamid , Faical Azouaou , Sofiane Batata

分类：自然语言处理

2022-01-10

在学术界，抄袭肯定不是一个新兴的关注，但它随着互联网的普及和对全球内容来源的易于访问而变得更大的程度，使人类干预不足。尽管如此，由于计算机辅助抄袭检测，抄袭远远远非是一个未被解除的问题，目前是一个有效的研究领域，该研究落在信息检索（IR）和自然语言处理（NLP）领域。许多软件解决方案有助于满足这项任务，本文概述了用于阿拉伯语，法国和英语学术和教育环境的抄袭检测系统。比较在八个系统之间持有，并在检测不同来源的三个混淆水平的特征，可用性，技术方面以及它们的性能之间进行：逐字，释义和跨语言抄袭。在本研究的背景下也进行了对技术形式的抄袭技术形式的关注检查。此外，还提供了对不同作者提出的抄袭类型和分类的调查。

translated by 谷歌翻译

Proceedings of the 2nd International Workshop on Reading Music Systems

Jorge Calvo-Zaragoza , Alexander Pacha

分类：计算机视觉 | 机器学习

2022-12-01

The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.

translated by 谷歌翻译

Dbpedia: A nucleus for a web of open data

分类：

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against datasets derived from Wikipedia and to link other datasets on the Web to Wikipedia data. We describe the extraction of the DBpedia datasets, and how the resulting information is published on the Web for human-and machine-consumption. We describe some emerging applications from the DBpedia community and show how website authors can facilitate DBpedia content within their sites. Finally, we present the current status of interlinking DBpedia with other open datasets on the Web and outline how DBpedia could serve as a nucleus for an emerging Web of open data.

translated by 谷歌翻译

Analyzing the State of Computer Science Research with the DBLP Discovery Dataset

Lennart Küll

分类：自然语言处理

2022-12-01

The number of scientific publications continues to rise exponentially, especially in Computer Science (CS). However, current solutions to analyze those publications restrict access behind a paywall, offer no features for visual analysis, limit access to their data, only focus on niches or sub-fields, and/or are not flexible and modular enough to be transferred to other datasets. In this thesis, we conduct a scientometric analysis to uncover the implicit patterns hidden in CS metadata and to determine the state of CS research. Specifically, we investigate trends of the quantity, impact, and topics for authors, venues, document types (conferences vs. journals), and fields of study (compared to, e.g., medicine). To achieve this we introduce the CS-Insights system, an interactive web application to analyze CS publications with various dashboards, filters, and visualizations. The data underlying this system is the DBLP Discovery Dataset (D3), which contains metadata from 5 million CS publications. Both D3 and CS-Insights are open-access, and CS-Insights can be easily adapted to other datasets in the future. The most interesting findings of our scientometric analysis include that i) there has been a stark increase in publications, authors, and venues in the last two decades, ii) many authors only recently joined the field, iii) the most cited authors and venues focus on computer vision and pattern recognition, while the most productive prefer engineering-related topics, iv) the preference of researchers to publish in conferences over journals dwindles, v) on average, journal articles receive twice as many citations compared to conference papers, but the contrast is much smaller for the most cited conferences and journals, and vi) journals also get more citations in all other investigated fields of study, while only CS and engineering publish more in conferences than journals.

translated by 谷歌翻译

Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich Platform for Graph Learning Benchmarks

Jiaqi Ma , Xingjian Zhang , Hezheng Fan , Jin Huang , Tianyue Li , Ting Wei Li , Yiwen Tu , Chenshu Zhu , Qiaozhu Mei

分类：机器学习

2022-12-08

Establishing open and general benchmarks has been a critical driving force behind the success of modern machine learning techniques. As machine learning is being applied to broader domains and tasks, there is a need to establish richer and more diverse benchmarks to better reflect the reality of the application scenarios. Graph learning is an emerging field of machine learning that urgently needs more and better benchmarks. To accommodate the need, we introduce Graph Learning Indexer (GLI), a benchmark curation platform for graph learning. In comparison to existing graph learning benchmark libraries, GLI highlights two novel design objectives. First, GLI is designed to incentivize \emph{dataset contributors}. In particular, we incorporate various measures to minimize the effort of contributing and maintaining a dataset, increase the usability of the contributed dataset, as well as encourage attributions to different contributors of the dataset. Second, GLI is designed to curate a knowledge base, instead of a plain collection, of benchmark datasets. We use multiple sources of meta information to augment the benchmark datasets with \emph{rich characteristics}, so that they can be easily selected and used in downstream research or development. The source code of GLI is available at \url{https://github.com/Graph-Learning-Benchmarks/gli}.

translated by 谷歌翻译

Interactive Question Answering Systems: Literature Review

Giovanni Maria Biancofiore , Yashar Deldjoo , Tommaso Di Noia , Eugenio Di Sciascio , Fedelucio Narducci

分类：自然语言处理 | 人工智能

2022-09-04

问答系统被认为是流行且经常有效的信息在网络上寻求信息的手段。在这样的系统中，寻求信息者可以通过自然语言提出问题来获得对他们的查询的简短回应。交互式问题回答是一种最近提出且日益流行的解决方案，它位于问答和对话系统的交集。一方面，用户可以以普通语言提出问题，并找到对她的询问的实际回答；另一方面，如果在初始请求中有多个可能的答复，很少或歧义，则系统可以将问题交通会话延长到对话中。通过允许用户提出更多问题，交互式问题回答使用户能够与系统动态互动并获得更精确的结果。这项调查提供了有关当前文献中普遍存在的交互式提问方法的详细概述。它首先要解释提问系统的基本原理，从而定义新的符号和分类法，以将所有已确定的作品结合在统一框架内。然后，根据提出的方法，评估方法和数据集/应用程序域来介绍和检查有关交互式问题解答系统的审查已发表的工作。我们还描述了围绕社区提出的特定任务和问题的趋势，从而阐明了学者的未来利益。 GitHub页面的综合综合了本文献研究中涵盖的所有主要主题，我们的工作得到了进一步的支持。 https://sisinflab.github.io/interactive-question-answering-systems-survey/

translated by 谷歌翻译

Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians

Majid Jaberi-Douraki , Soudabeh Taghian Dinani , Nuwan Indika Millagaha Gedara , Xuan Xu , Emily Richards , Fiona Maunsell , Nader Zad , Lisa Ann Tell

分类：机器学习

2021-12-02

食品药物中的额外标签药物使用由美国动物药用药物使用澄清法（AMDUCA）授权，估计的戒断间隔基于已发表的科学药代动力学数据。偶尔会有一种缺乏基于戒断间隔或正在处理的大量动物的科学数据的缺乏，驱动需要测试药物残留物的需要。快速测定商业农场侧测试对于监测动物产品中的药物残留物来保护人类健康至关重要。已经在制造商的网站上报告了用于商业快速测定测试的活性成分，灵敏度，矩阵和物种，或者在消费者可用的PDF文件中，但可能需要特殊访问请求。此外，该信息并不总是与FDA批准的公差相关联。此外，这些测试的参数变化可能非常具有挑战性，以定期识别，特别是网站上列出的那些或未公开可用的文件。因此，人工智能在有效地提取数据并确保当前信息时发挥着关键作用。通过学术界和商业工具建设者研究了从PDF和HTML文件中提取表。在实施自然语言规划方面，这些文件的文本挖掘研究已成为一个广泛但挑战的竞技场。然而，提取表的技术仍在他们的初期，并由研究人员调查和改进。在本研究中，我们开发并评估了数据挖掘方法，用于自动从电子文档中提取快速测定数据。我们的自动电子数据提取方法包括软件包模块，开发的模式识别工具和数据挖掘发动机。测定细节由几个生产这些快速药物残留测定的商业实体提供

translated by 谷歌翻译

Analyzing social media with crowdsourcing in Crowd4SDG

Carlo Bono , Mehmet Oğuz Mülâyim , Cinzia Cappiello , Mark Carman , Jesus Cerquides , Jose Luis Fernandez-Marquez , Rosy Mondardini , Edoardo Ramalli , Barbara Pernici

分类：人工智能

2022-08-04

社交媒体有可能提供有关紧急情况和突然事件的及时信息。但是，在每天发布的数百万帖子中找到相关信息可能很困难，并且开发数据分析项目通常需要时间和技术技能。这项研究提出了一种为分析社交媒体的灵活支持的方法，尤其是在紧急情况下。引入了可以采用社交媒体分析的不同用例，并讨论了从大量帖子中检索信息的挑战。重点是分析社交媒体帖子中包含的图像和文本，以及一组自动数据处理工具，用于过滤，分类和使用人类的方法来支持数据分析师的内容。这种支持包括配置自动化工具的反馈和建议，以及众包收集公民的投入。通过讨论Crowd4SDG H2020欧洲项目中开发的三个案例研究来验证结果。

translated by 谷歌翻译

Entity Graph Extraction from Legal Acts -- a Prototype for a Use Case in Policy Design Analysis

Anna Wróblewska , Bartosz Pieliński , Karolina Seweryn , Karol Saputa , Aleksandra Wichrowska , Sylwia Sysko-Romańczuk , Hanna Schreiber

分类：自然语言处理

2022-09-02

本文介绍了有关开发的原型的研究，以服务公共政策设计的定量研究。政治学的这种子学科着重于确定参与者，之间的关系以及在健康，环境，经济和其他政策方面可以使用的工具。我们的系统旨在自动化收集法律文件，用机构语法注释它们的过程，并使用超图来分析关键实体之间的相互关系。我们的系统经过了《联合国教科文组织公约》的保护，以保护2003年的无形文化遗产，这是一份法律文件，该文件规定了确保文化遗产的国际关系的基本方面。

translated by 谷歌翻译

HTML版本

YMIR: A Rapid Data-centric Development Platform for Vision Applications

Phoenix X. Huang , Wenze Hu , William Brendel , Manmohan Chandraker , Li-Jia Li , Xiaoyu Wang

分类：人工智能 | 机器学习

2021-11-19

本文介绍了一种开源平台，可快速发展计算机视觉应用。该平台在机器学习开发过程的中心进行了高效的数据开发，集成了主动学习方法，数据和型号版本控制，并使用项目等概念，以便并行启用多个任务特定数据集的快速迭代。我们通过将开发过程抽象到核心状态和操作中，设计开放式平台，并设计开放API，将第三方工具集成为操作的实现。这种开放式设计降低了ML与现有工具的ML团队的开发成本和采用费用。与此同时，该平台支持录制项目开发历史记录，可以共享成功的项目，以进一步提高类似任务的模型生产效率。该平台是开源的，已经在内部使用，以满足自定义现实世界计算机视觉应用程序的日益增长的需求。

translated by 谷歌翻译

PhishMatch: A Layered Approach for Effective Detection of Phishing URLs

Harshal Tupsamudre , Sparsh Jain , Sachin Lodha

分类：机器学习

2021-12-04

网络钓鱼袭击在互联网上继续成为一个重大威胁。先前的研究表明，可以确定网站是否是网络钓鱼，也可以更仔细地分析其URL。基于URL的方法的一个主要优点是它即使在浏览器中呈现网页之前，它也可以识别网络钓鱼网站，从而避免了其他潜在问题，例如加密和驾驶下载。但是，传统的基于URL的方法有它们的局限性。基于黑名单的方法容易出现零小时网络钓鱼攻击，基于先进的机器学习方法消耗高资源，而其他方法将URL发送到远程服务器，损害用户的隐私。在本文中，我们提出了一个分层的防护防御，PhishMatch，这是强大，准确，廉价和客户端的。我们设计一种节省空间高效的AHO-Corasick算法，用于精确串联匹配和基于N-GRAM的索引技术，用于匹配的近似字符串，以检测网络钓鱼URL中的各种弧度标准技术。为了减少误报，我们使用全球白名单和个性化用户白名单。我们还确定访问URL的上下文并使用该信息更准确地对输入URL进行分类。 PhishMatch的最后一个组成部分涉及机器学习模型和受控搜索引擎查询以对URL进行分类。发现针对Chrome浏览器开发的PhishMatch的原型插件，是快速轻便的。我们的评价表明，PhishMatch既有效又有效。

translated by 谷歌翻译

Learning to Rank with Small Set of Ground Truth Data

Jiashu Wu

分类：人工智能

2022-07-04

在过去的几十年中，研究人员已经付出了许多努力，调查用于排名在信息检索过程中检索到的查询结果的排名技术，或在推荐系统中对推荐产品进行排名。在该项目中，我们旨在调查搜索，排名以及建议技术，以帮助实现大学学术界搜索平台。与通常的信息检索方案不同，在我们的情况下，存在许多基础真理排名数据，我们对学术界排名的基础真相知识有限。例如，考虑到一些搜索查询，我们只知道一些高度相关的研究人员，因此应该排名最高，对于其他一些搜索查询，我们不知道应该将哪些研究人员排名最高。有限的地面真相数据使一些常规的排名技术和评估指标变得不可行，这是我们在本项目中面临的巨大挑战。该项目可以在很大程度上增强用户的学术搜索经验，有助于实现一个学术搜索平台，其中包括研究人员，出版物和研究信息领域，这不仅对大学学院，而且对学生的研究经验都有益。

translated by 谷歌翻译

Logic Mill -- A Knowledge Navigation System

Sebastian Erhardt , Mainak Ghosh , Erik Buunk , Michael E. Rose , Dietmar Harhoff

分类：自然语言处理

2022-12-31

Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents within either one domain-specific corpus or multi-domain corpora. It uses advanced Natural Language Processing (NLP) techniques to generate numerical representations of documents. Currently it leverages a large pre-trained language model to generate these document representations. The system focuses on scientific publications and patent documents and contains more than 200 million documents. It is easily accessible via a simple Application Programming Interface (API) or via a web interface. Moreover, it is continuously being updated and can be extended to text corpora from other domains. We see this system as a general-purpose tool for future research applications in the social sciences and other domains.

translated by 谷歌翻译

JEMMA: An Extensible Java Dataset for ML4Code Applications

Anjan Karmakar , Miltiadis Allamanis , Romain Robbes

分类：机器学习

2022-12-18

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

translated by 谷歌翻译

Automatic generation of semantic corpora for improving intent estimation of taxonomy-driven search engines

Lorenzo Massai

分类：自然语言处理 | 人工智能

2022-03-30

随着能够在不同用户上下文（例如，移动中的用户）操作的智能系统的需求不断增长，因此，该系统对用户需要的正确解释对于对用户查询的答案提供了一致的答案至关重要。用于解决此类任务的最有效技术是在自然语言处理和术语语义扩展的领域中。这样的系统旨在估计输入查询的实际含义，以解决用户问题中表达的单词的概念。本文的目的是证明哪种语义关系在基于语义扩展的检索系统中影响最大的，并确定在结合此类关系时的准确性和噪声引入之间的最佳权衡。评估使得构建一个简单的自然语言处理系统，能够查询任何分类驱动的领域，从而利用不同语义扩展作为知识资源的组合。拟议的评估采用广泛而多样的分类法作为用例，利用其标签作为扩展的基础。为了建立知识资源，已经生产并集成了几个语料库，并将其集成到NLP基础架构中，目的是估算与分类学标签相对应的伪征值，被认为是可能的意图。

translated by 谷歌翻译

The Development and Applications of Food Knowledge Graphs in the Food Science and Industry

Weiqing Min , Chunlin Liu , Leyi Xu , Shuqiang Jiang

分类：计算机视觉

2021-07-13

各种网络的部署（例如，事物互联网（IOT）和移动网络），数据库（例如，营养表和食品组成数据库）和社交媒体（例如，Instagram和Twitter）产生大量的多型食品数据，这在食品科学和工业中起着关键作用。然而，由于众所周知的数据协调问题，这些多源食品数据显示为信息孤岛，导致难以充分利用这些食物数据。食物知识图表提供了统一和标准化的概念术语及其结构形式的关系，因此可以将食物信息孤单转换为更可重复使用的全球数量数字连接的食物互联网以使各种应用有益。据我们所知，这是食品科学与工业中食品知识图表的第一个全面审查。我们首先提供知识图表的简要介绍，然后主要从食物分类，食品本体到食品知识图表的进展。粮食知识图表的代表性应用将在新的配方开发，食品可追溯性，食物数据可视化，个性化饮食推荐，食品搜索和质询回答，视觉食品对象识别，食品机械智能制造方面来概述。我们还讨论了该领域的未来方向，例如食品供应链系统和人类健康的食品知识图，这应该得到进一步的研究。他们的巨大潜力将吸引更多的研究努力，将食物知识图形应用于食品科学和工业领域。

translated by 谷歌翻译

Globus Automation Services: Research process automation across the space-time continuum

Ryan Chard , Jim Pruyne , Kurt McKee , Josh Bryan , Brigitte Raumann , Rachana Ananthakrishnan , Kyle Chard , Ian Foster

分类：人工智能

2022-08-19

研究过程自动化 - 对科学仪器，计算机，数据存储和其他资源的可靠，高效和可重复执行的可靠，高效和可重复执行，这是现代科学的基本要素。我们在此处报告Globus研究数据管理平台内的新服务，该服务可以将各种研究过程的规范作为可重复使用的动作集，流量以及在异质研究环境中执行此类流动的集合。为了以广泛的空间范围（例如，从科学仪器到远程数据中心）和时间范围（从几秒钟到几周），这些Globus自动化服务功能：1）云托管以可靠地执行长期持久的流量，尽管零星的失败，但这些Globus自动化服务功能：1） ; 2）声明性符号和可扩展的异步行动提供商API，用于定义和执行涉及任意资源的各种行动和流动规范； 3）授权授权机制，用于安全调用动作。这些服务允许研究人员将广泛的研究任务的管理外包和自动化为可靠，可扩展和安全的云平台。我们向Globus自动化服务提供用例

translated by 谷歌翻译

Deep Learning Driven Natural Languages Text to SQL Query Conversion: A Survey

Ayush Kumar , Parth Nagarkar , Prabhav Nalhe , Sanjeev Vijayakumar

分类：自然语言处理 | 人工智能

2022-08-08

随着未来以数据为中心的决策，对数据库的无缝访问至关重要。关于创建有效的文本到SQL（Text2SQL）模型以访问数据库的数据有广泛的研究。使用自然语言是可以通过有效访问数据库（尤其是对于非技术用户）来弥合数据和结果之间差距的最佳接口之一。它将打开门，并在精通技术技能或不太熟练的查询语言的用户中引起极大的兴趣。即使提出或研究了许多基于深度学习的算法，在现实工作场景中使用自然语言来解决数据查询问题仍然非常具有挑战性。原因是在不同的研究中使用不同的数据集，这带来了其局限性和假设。同时，我们确实缺乏对这些提议的模型及其对其训练的特定数据集的局限性的彻底理解。在本文中，我们试图介绍过去几年研究的24种神经网络模型的整体概述，包括其涉及卷积神经网络，经常性神经网络，指针网络，强化学习，生成模型等的架构。我们还概述11个数据集，这些数据集被广泛用于训练Text2SQL技术的模型。我们还讨论了无缝数据查询中文本2SQL技术的未来应用可能性。

translated by 谷歌翻译