智能论文笔记

Bridging the gap to real-world for network intrusion detection systems with data-centric approach

Gustavo de Carvalho Bertoli , Lourenço Alves Pereira Junior , Filipe Alves Neto Verri , Aldri Luiz dos Santos , Osamu Saotome

分类：人工智能 | 机器学习

2021-10-25

对于网络入侵检测系统（NIDS）使用机器学习（ML）的大多数研究都使用良好的数据集，例如KDD-CUP99，NSL-KDD，UNSW-NB15和Cicids-2017。在这种情况下，探讨了机器学习技术的可能性，旨在与已发表的基线（以模型为中心的方法）相比的度量改进。但是，这些数据集将一些限制呈现为老化，使得将基于ML的解决方案转换为现实世界的应用程序，这使得它不可行。本文提出了一种系统以系统为中心的方法来解决NIDS研究的当前限制，特别是数据集。此方法生成由最近的网络流量和攻击组成的NID数据集，其中包含设计的标签过程。

translated by 谷歌翻译

Intuitive Robot Programming by Capturing Human Manufacturing Skills: A Framework for the Process of Glass Adhesive Application

Mihail Babcinschi , Francisco Cruz , Nicole Duarte , Silvia Santos , Samuel Alves , Pedro Neto

分类：机器人

2022-09-15

对制造工艺的机器化的需求很大，因此单调劳动。一些需要特定技能的制造任务（焊接，绘画等）缺乏工人。机器人已在这些任务中使用，但是它们的灵活性受到限制，因为它们仍然很难通过非专家编程/重新编程，从而使它们无法访问大多数公司。机器人离线编程（OLP）是可靠的。但是，直接来自CAD/CAM的生成路径不包括代表人类技能的相关参数，例如机器人最终效应器的方向和速度。本文提出了一个直观的机器人编程系统，以捕捉人类制造技能并将其转变为机器人程序。使用连接到工作工具的磁跟踪系统记录人类熟练工人的演示。收集的数据包括工作路径的方向和速度。位置数据是从CAD/CAM中提取的，因为磁跟踪器捕获时的误差很明显。路径姿势在笛卡尔空间中转换，并在模拟环境中进行验证。生成机器人程序并将其转移到真正的机器人。关于玻璃粘合剂应用过程的实验证明了拟议框架捕获人类技能并将其转移到机器人方面的使用和有效性的直觉。

translated by 谷歌翻译

Trajectory Planning for Hybrid Unmanned Aerial Underwater Vehicles with Smooth Media Transition

Pedro Miranda Pinheiro , Armando Alves Neto , Ricardo Bedin Grando , Cesar Bastos da Silva , Vivian Misaki Aoki , Dayana Cardoso , Alexandre Campos Horn , Paulo Lilles Jorge Drews-Jr

分类：机器人

2021-12-27

在过去的十年中，在杂交无人驾驶空中水下车辆的研究中努力，机器人可以轻松飞行和潜入水中的机械适应水平。然而，大多数文献集中在物理设计，建筑物的实际问题上，最近，低水平的控制策略。在高级情报的背景下，如运动规划和与现实世界的互动的情况下已经完成。因此，我们在本文中提出了一种轨迹规划方法，允许避免避免未知的障碍和空中媒体之间的平滑过渡。我们的方法基于经典迅速探索随机树的变体，其主要优点是处理障碍，复杂的非线性动力学，模型不确定性和外部干扰的能力。该方法使用\ Hydrone的动态模型，提出具有高水下性能的混合动力车辆，但我们认为它可以很容易地推广到其他类型的空中/水生平台。在实验部分中，我们在充满障碍物的环境中显示了模拟结果，其中机器人被命令执行不同的媒体运动，展示了我们的策略的适用性。

translated by 谷歌翻译

Generalization Bounds for Inductive Matrix Completion in Low-noise Settings

Antoine Ledent , Rodrigo Alves , Yunwen Lei , Yann Guermeur , Marius Kloft

分类：机器学习 | (统计)机器学习

2022-12-16

We study inductive matrix completion (matrix completion with side information) under an i.i.d. subgaussian noise assumption at a low noise regime, with uniform sampling of the entries. We obtain for the first time generalization bounds with the following three properties: (1) they scale like the standard deviation of the noise and in particular approach zero in the exact recovery case; (2) even in the presence of noise, they converge to zero when the sample size approaches infinity; and (3) for a fixed dimension of the side information, they only have a logarithmic dependence on the size of the matrix. Differently from many works in approximate recovery, we present results both for bounded Lipschitz losses and for the absolute loss, with the latter relying on Talagrand-type inequalities. The proofs create a bridge between two approaches to the theoretical analysis of matrix completion, since they consist in a combination of techniques from both the exact recovery literature and the approximate recovery literature.

translated by 谷歌翻译

Quotations, Coreference Resolution, and Sentiment Annotations in Croatian News Articles: An Exploratory Study

Jelena Sarajlić , Gaurish Thakkar , Diego Alves , Nives Mikelic Preradović

分类：自然语言处理

2022-12-14

This paper presents a corpus annotated for the task of direct-speech extraction in Croatian. The paper focuses on the annotation of the quotation, co-reference resolution, and sentiment annotation in SETimes news corpus in Croatian and on the analysis of its language-specific differences compared to English. From this, a list of the phenomena that require special attention when performing these annotations is derived. The generated corpus with quotation features annotations can be used for multiple tasks in the field of Natural Language Processing.

translated by 谷歌翻译

Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Diego Alves , Gaurish Thakkar , Gabriel Amaral , Tin Kuculo , Marko Tadić

分类：自然语言处理

2022-12-14

With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.

translated by 谷歌翻译

Building and Evaluating Universal Named-Entity Recognition English corpus

Diego Alves , Gaurish Thakkar , Marko Tadić

分类：自然语言处理

2022-12-14

This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated. Furthermore, we conducted a set of experiments to improve the annotations in terms of precision, recall, and F1-measure. The final dataset is available and the established workflow can be applied to any language with existing Wikipedia and DBpedia. As part of future research, we intend to continue improving the annotation process and extend it to other languages.

translated by 谷歌翻译

Characterizing instance hardness in classification and regression problems

Gustavo P. Torquette , Victor S. Nunes , Pedro Y. A. Paiva , Lourenço B. C. Neto , Ana C. Lorena

分类：机器学习

2022-12-04

Some recent pieces of work in the Machine Learning (ML) literature have demonstrated the usefulness of assessing which observations are hardest to have their label predicted accurately. By identifying such instances, one may inspect whether they have any quality issues that should be addressed. Learning strategies based on the difficulty level of the observations can also be devised. This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately and why they are so, aka instance hardness measures. Both classification and regression problems are considered. Synthetic datasets with different levels of complexity are built and analyzed. A Python package containing all implementations is also provided.

translated by 谷歌翻译

Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Sérgio Jesus , José Pombal , Duarte Alves , André Cruz , Pedro Saleiro , Rita P. Ribeiro , João Gama , Pedro Bizarro

分类：机器学习

2022-11-24

Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.

translated by 谷歌翻译

Chronic pain patient narratives allow for the estimation of current pain intensity

Diogo A. P. Nunes , Joana Ferreira-Gomes , Carlos Vaz , Daniela Oliveira , Sofia Pimenta , Fani Neto , David Martins de Matos

分类：自然语言处理

2022-10-31

Chronic pain is a multi-dimensional experience, and pain intensity plays an important part, impacting the patients emotional balance, psychology, and behaviour. Standard self-reporting tools, such as the Visual Analogue Scale for pain, fail to capture this burden. Moreover, this type of tools is susceptible to a degree of subjectivity, dependent on the patients clear understanding of how to use it, social biases, and their ability to translate a complex experience to a scale. To overcome these and other self-reporting challenges, pain intensity estimation has been previously studied based on facial expressions, electroencephalograms, brain imaging, and autonomic features. However, to the best of our knowledge, it has never been attempted to base this estimation on the patient narratives of the personal experience of chronic pain, which is what we propose in this work. Indeed, in the clinical assessment and management of chronic pain, verbal communication is essential to convey information to physicians that would otherwise not be easily accessible through standard reporting tools, since language, sociocultural, and psychosocial variables are intertwined. We show that language features from patient narratives indeed convey information relevant for pain intensity estimation, and that our computational models can take advantage of that. Specifically, our results show that patients with mild pain focus more on the use of verbs, whilst moderate and severe pain patients focus on adverbs, and nouns and adjectives, respectively, and that these differences allow for the distinction between these three pain classes.

translated by 谷歌翻译