几项作品已经研究了主观文本,因为它们可以在用户中引起某些行为。大多数工作都集中在社交网络中的用户生成的文本上,但是其他一些文本也包括对某些主题的观点,可能会影响政治决策期间的判断标准。在这项工作中,我们解决了针对新闻头条领域的有针对性情绪分析的任务,该领域由主要渠道在2019年阿根廷总统大选期间发布。为此,我们介绍了1,976个头条新闻的极性数据集,该数据集在2019年选举中以目标级别提及候选人。基于预训练的语言模型的最先进的分类算法的初步实验表明,目标信息有助于此任务。我们公开提供数据和预培训模型。
translated by 谷歌翻译
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.
translated by 谷歌翻译
We study inductive matrix completion (matrix completion with side information) under an i.i.d. subgaussian noise assumption at a low noise regime, with uniform sampling of the entries. We obtain for the first time generalization bounds with the following three properties: (1) they scale like the standard deviation of the noise and in particular approach zero in the exact recovery case; (2) even in the presence of noise, they converge to zero when the sample size approaches infinity; and (3) for a fixed dimension of the side information, they only have a logarithmic dependence on the size of the matrix. Differently from many works in approximate recovery, we present results both for bounded Lipschitz losses and for the absolute loss, with the latter relying on Talagrand-type inequalities. The proofs create a bridge between two approaches to the theoretical analysis of matrix completion, since they consist in a combination of techniques from both the exact recovery literature and the approximate recovery literature.
translated by 谷歌翻译
This paper presents a corpus annotated for the task of direct-speech extraction in Croatian. The paper focuses on the annotation of the quotation, co-reference resolution, and sentiment annotation in SETimes news corpus in Croatian and on the analysis of its language-specific differences compared to English. From this, a list of the phenomena that require special attention when performing these annotations is derived. The generated corpus with quotation features annotations can be used for multiple tasks in the field of Natural Language Processing.
translated by 谷歌翻译
With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset.
translated by 谷歌翻译
This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated. Furthermore, we conducted a set of experiments to improve the annotations in terms of precision, recall, and F1-measure. The final dataset is available and the established workflow can be applied to any language with existing Wikipedia and DBpedia. As part of future research, we intend to continue improving the annotation process and extend it to other languages.
translated by 谷歌翻译
In this paper, we examine the problem of visibility-aware robot navigation among movable obstacles (VANAMO). A variant of the well-known NAMO robotic planning problem, VANAMO puts additional visibility constraints on robot motion and object movability. This new problem formulation lifts the restrictive assumption that the map is fully visible and the object positions are fully known. We provide a formal definition of the VANAMO problem and propose the Look and Manipulate Backchaining (LaMB) algorithm for solving such problems. LaMB has a simple vision-based API that makes it more easily transferable to real-world robot applications and scales to the large 3D environments. To evaluate LaMB, we construct a set of tasks that illustrate the complex interplay between visibility and object movability that can arise in mobile base manipulation problems in unknown environments. We show that LaMB outperforms NAMO and visibility-aware motion planning approaches as well as simple combinations of them on complex manipulation problems with partial observability.
translated by 谷歌翻译
Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
translated by 谷歌翻译
近年来,变形金刚的体系结构在受欢迎程度上一直在越来越流行。调制检测变压器(MDETR)是一个端到端的多模式理解模型,该模型执行诸如相位接地,引用表达理解,参考表达分割和视觉问题答案之类的任务。该模型的一个了不起的方面是可以推断出以前未经培训的类别的能力。在这项工作中,我们探讨了MDETR在一项新任务中的使用,即动作检测,没有任何以前的培训。我们使用原子视觉动作数据集获得定量结果。尽管该模型没有报告任务中的最佳性能,但我们认为这是一个有趣的发现。我们表明,可以使用多模式模型来解决其设计不适合的任务。最后,我们认为,这一研究可能导致MDETR在其他下游任务中的概括。
translated by 谷歌翻译
对制造工艺的机器化的需求很大,因此单调劳动。一些需要特定技能的制造任务(焊接,绘画等)缺乏工人。机器人已在这些任务中使用,但是它们的灵活性受到限制,因为它们仍然很难通过非专家编程/重新编程,从而使它们无法访问大多数公司。机器人离线编程(OLP)是可靠的。但是,直接来自CAD/CAM的生成路径不包括代表人类技能的相关参数,例如机器人最终效应器的方向和速度。本文提出了一个直观的机器人编程系统,以捕捉人类制造技能并将其转变为机器人程序。使用连接到工作工具的磁跟踪系统记录人类熟练工人的演示。收集的数据包括工作路径的方向和速度。位置数据是从CAD/CAM中提取的,因为磁跟踪器捕获时的误差很明显。路径姿势在笛卡尔空间中转换,并在模拟环境中进行验证。生成机器人程序并将其转移到真正的机器人。关于玻璃粘合剂应用过程的实验证明了拟议框架捕获人类技能并将其转移到机器人方面的使用和有效性的直觉。
translated by 谷歌翻译