Transformer在学习视觉和语言表示方面取得了巨大的成功,这在各种下游任务中都是一般的。在视觉控制中,可以在不同控制任务之间转移的可转移状态表示对于减少训练样本量很重要。但是,将变压器移植到样品有效的视觉控制仍然是一个具有挑战性且未解决的问题。为此,我们提出了一种新颖的控制变压器(CTRLFORMER),具有先前艺术所没有的许多吸引人的好处。首先,CTRLFORMER共同学习视觉令牌和政策令牌之间的自我注意事项机制,在不同的控制任务之间可以学习和转移多任务表示无灾难性遗忘。其次,我们仔细设计了一种对比的增强学习范式来训练Ctrlformer,从而使其能够达到高样本效率,这在控制问题中很重要。例如,在DMControl基准测试中,与最近的高级方法不同,该方法在使用100K样品转移学习后通过在“ Cartpole”任务中产生零分数而失败,CTRLFORMER可以在维持100K样本的同时获得最先进的分数先前任务的性能。代码和模型已在我们的项目主页中发布。
translated by 谷歌翻译
室内场景云的无监督对比学习取得了巨大的成功。但是,室外场景中无监督的学习点云仍然充满挑战,因为以前的方法需要重建整个场景并捕获对比度目标的部分视图。这在带有移动物体,障碍物和传感器的室外场景中是不可行的。在本文中,我们提出了CO^3,即合作对比度学习和上下文形状的预测,以无监督的方式学习3D表示室外景点云。与现有方法相比,Co^3具有几种优点。 (1)它利用了从车辆侧和基础架构侧来的激光点云来构建差异,但同时维护对比度学习的通用语义信息,这比以前的方法构建的视图更合适。 (2)在对比度目标的同时,提出了形状上下文预测作为预训练目标,并为无监督的3D点云表示学习带来了更多与任务相关的信息,这在将学习的表示形式转移到下游检测任务时是有益的。 (3)与以前的方法相比,CO^3学到的表示形式可以通过不同类型的LIDAR传感器收集到不同的室外场景数据集。 (4)CO^3将一次和Kitti数据集的当前最新方法提高到2.58地图。代码和模型将发布。我们认为Co^3将有助于了解室外场景中的LiDar Point云。
translated by 谷歌翻译
盲面修复是从未知的降解中恢复高质量的面部图像。由于面部图像包含丰富的上下文信息,因此我们提出了一种方法,还可以修复器,该方法探讨了完全空间的关注,以模拟上下文信息并超越了使用本地运营商的现有作品。与先前的艺术相比,还原构造器具有多种好处。首先,与以前视觉变压器(VIT)中传统的多头自我发作不同,还原构图结合了多头跨注意层,以学习损坏的查询与高质量的键值对之间的完全空间相互作用。其次,从重建为导向的高质量词典中对Resotreformer中的钥匙值对进行采样,其元素具有富含高质量的面部特征,专门针对面部重建,从而导致出色的恢复结果。第三,RestoreFormer优于一个合成数据集和三个现实世界数据集上的先进的最新方法,并且可以产生具有更好视觉质量的图像。
translated by 谷歌翻译
本文介绍了一个简单的MLP架构,CycleMLP,这是一种多功能骨干,用于视觉识别和密集的预测。与现代MLP架构相比,例如MLP混合器,RESMLP和GMLP,其架构与图像尺寸相关,因此在物体检测和分割中不可行,与现代方法相比具有两个优点。 (1)它可以应对各种图像尺寸。 (2)通过使用本地窗口,它可以实现对图像大小的线性计算复杂性。相比之下,由于完全空间连接,以前的MLP具有$ O(n ^ 2)$计算。我们构建一系列模型,超越现有的MLP,甚至最先进的基于变压器的模型,例如,使用较少的参数和拖鞋。我们扩展了类似MLP的模型的适用性,使它们成为密集预测任务的多功能骨干。 CycleMLP在对象检测,实例分割和语义细分上实现了竞争结果。特别是,Cyclemlp-tiny优于3.3%Miou在Ade20K数据集中的速度较少,具有较少的拖鞋。此外,CycleMLP还在Imagenet-C数据集上显示出优异的零射鲁布利。代码可以在https://github.com/shoufachen/cyclemlp获得。
translated by 谷歌翻译
在许多现实世界应用中,基于图表编辑距离(GED)等指标(GED)等图表之间计算相似性得分的能力很重要。计算精确的GED值通常是一个NP硬性问题,传统算法通常在准确性和效率之间实现不令人满意的权衡。最近,图形神经网络(GNNS)为该任务提供了数据驱动的解决方案,该解决方案更有效,同时保持小图中的预测准确性(每图约10个节点)相似性计算。现有的基于GNN的方法分别嵌入了两个图(缺乏低水平的横向互动)或用于整个图表对(冗余和耗时)的部署跨冲突相互作用,在图中的节点数量增加。在本文中,我们着重于大规模图的相似性计算,并提出了“嵌入式磨合匹配”框架cosimgnn,该框架首先嵌入和粗大图形具有自适应池操作,然后在污垢的图表上部署细粒度的相互作用,以便在污垢的图形上进行污垢的互动最终相似性得分。此外,我们创建了几个合成数据集,这些数据集为图形相似性计算提供了新的基准测试。已经进行了有关合成数据集和现实世界数据集的详细实验,并且Cosimgnn实现了最佳性能,而推理时间最多是以前的Etab-The-The-The-ART的1/3。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.
translated by 谷歌翻译