当相互作用数据稀缺时,深厚的增强学习(RL)算法遭受了严重的性能下降,这限制了其现实世界的应用。最近,视觉表示学习已被证明是有效的,并且有望提高RL样品效率。这些方法通常依靠对比度学习和数据扩展来训练状态预测的过渡模型,这与在RL中使用模型的方式不同 - 基于价值的计划。因此,学到的模型可能无法与环境保持良好状态并产生一致的价值预测,尤其是当国家过渡不是确定性的情况下。为了解决这个问题,我们提出了一种称为价值一致表示学习(VCR)的新颖方法,以学习与决策直接相关的表示形式。更具体地说,VCR训练一个模型,以预测基于当前的状态(也称为“想象的状态”)和一系列动作。 VCR没有将这个想象中的状态与环境返回的真实状态保持一致,而是在两个状态上应用$ q $ - 价值头,并获得了两个行动值分布。然后将距离计算并最小化以迫使想象的状态产生与真实状态相似的动作值预测。我们为离散和连续的动作空间开发了上述想法的两个实现。我们对Atari 100K和DeepMind Control Suite基准测试进行实验,以验证其提高样品效率的有效性。已经证明,我们的方法实现了无搜索RL算法的新最新性能。
translated by 谷歌翻译
在发展强化学习(RL)培训系统方面取得了重大进展。过去的作品,例如Impala,Apex,Seed RL,样本工厂等,旨在改善系统的整体吞吐量。在本文中,我们试图解决RL训练系统中的常见瓶颈,即平行环境执行,这通常是整个系统中最慢的部分,但很少受到关注。通过针对RL环境的策划设计,我们改善了不同硬件设置的RL环境模拟速度,从笔记本电脑和适度的工作站到NVIDIA DGX-A100等高端机器。在高端机器上,Envpool在Atari环境上的环境执行每秒可实现100万帧,在Mujoco环境上每秒执行300万帧。在笔记本电脑上运行时,Envpool的速度是Python子过程的2.8倍。此外,在开源社区中已经证明了与现有RL培训库的极大兼容性,包括Cleanrl,RL_Games,DeepMind Acme等。最后,Envpool允许研究人员以更快的速度迭代他们的想法,并具有巨大的潜力,并具有巨大的潜力事实上的RL环境执行引擎。示例运行表明,在笔记本电脑上训练Atari Pong和Mujoco Ant只需5分钟即可。 Envpool已经在https://github.com/sail-sg/envpool上开源。
translated by 谷歌翻译
现有的模仿学习(IL)方法,例如逆增强学习(IRL)通常具有双环培训过程,在学习奖励功能和政策之间交替,并且倾向于遭受较长的训练时间和较高的差异。在这项工作中,我们确定了可区分物理模拟器的好处,并提出了一种新的IL方法,即通过可区分的物理学(ILD)模仿学习,从而摆脱了双环设计,并在最终性能,收敛速度,融合速度,融合速度,融合速度上取得了重大改善和稳定性。提出的ILD将可区分的物理模拟器作为物理学将其纳入其策略学习的计算图中。它通过从参数化策略中采样动作来展开动力学,只需最大程度地减少专家轨迹与代理轨迹之间的距离,并通过时间物理操作员将梯度回到策略中。有了物理学的先验,ILD政策不仅可以转移到看不见的环境规范中,而且可以在各种任务上产生更高的最终表现。此外,ILD自然形成了单环结构,从而显着提高了稳定性和训练速度。为了简化时间物理操作引起的复杂优化景观,ILD在优化过程中动态选择每个状态的学习目标。在我们的实验中,我们表明ILD在各种连续控制任务中都超过了最先进的方法,只需要一个专家演示。此外,ILD可以应用于具有挑战性的可变形对象操纵任务,并可以推广到看不见的配置。
translated by 谷歌翻译
半弱监督和监督的学习最近在对象检测文献中引起了很大的关注,因为它们可以减轻成功训练深度学习模型所需的注释成本。半监督学习的最先进方法依赖于使用多阶段过程训练的学生老师模型,并大量数据增强。为弱监督的设置开发了自定义网络,因此很难适应不同的检测器。在本文中,引入了一种弱半监督的训练方法,以减少这些训练挑战,但通过仅利用一小部分全标记的图像,并在弱标记图像中提供信息来实现最先进的性能。特别是,我们基于通用抽样的学习策略以在线方式产生伪基真实(GT)边界框注释,消除了对多阶段培训的需求和学生教师网络配置。这些伪GT框是根据通过得分传播过程累积的对象建议的分类得分从弱标记的图像中采样的。 PASCAL VOC数据集的经验结果表明,使用VOC 2007作为完全标记的拟议方法可提高性能5.0%,而VOC 2012作为弱标记数据。同样,有了5-10%的完全注释的图像,我们观察到MAP中的10%以上的改善,表明对图像级注释的适度投资可以大大改善检测性能。
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译
Face Anti-spoofing (FAS) is essential to secure face recognition systems from various physical attacks. However, recent research generally focuses on short-distance applications (i.e., phone unlocking) while lacking consideration of long-distance scenes (i.e., surveillance security checks). In order to promote relevant research and fill this gap in the community, we collect a large-scale Surveillance High-Fidelity Mask (SuHiFiMask) dataset captured under 40 surveillance scenes, which has 101 subjects from different age groups with 232 3D attacks (high-fidelity masks), 200 2D attacks (posters, portraits, and screens), and 2 adversarial attacks. In this scene, low image resolution and noise interference are new challenges faced in surveillance FAS. Together with the SuHiFiMask dataset, we propose a Contrastive Quality-Invariance Learning (CQIL) network to alleviate the performance degradation caused by image quality from three aspects: (1) An Image Quality Variable module (IQV) is introduced to recover image information associated with discrimination by combining the super-resolution network. (2) Using generated sample pairs to simulate quality variance distributions to help contrastive learning strategies obtain robust feature representation under quality variation. (3) A Separate Quality Network (SQN) is designed to learn discriminative features independent of image quality. Finally, a large number of experiments verify the quality of the SuHiFiMask dataset and the superiority of the proposed CQIL.
translated by 谷歌翻译
When using LiDAR semantic segmentation models for safety-critical applications such as autonomous driving, it is essential to understand and improve their robustness with respect to a large range of LiDAR corruptions. In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. To rigorously evaluate the robustness and generalizability of current approaches, we propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. Then, we systematically investigate 11 LiDAR semantic segmentation models, especially spanning different input representations (e.g., point clouds, voxels, projected images, and etc.), network architectures and training schemes. Through this study, we obtain two insights: 1) We find out that the input representation plays a crucial role in robustness. Specifically, under specific corruptions, different representations perform variously. 2) Although state-of-the-art methods on LiDAR semantic segmentation achieve promising results on clean data, they are less robust when dealing with noisy data. Finally, based on the above observations, we design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications. It is promising that our benchmark, comprehensive analysis, and observations can boost future research in robust LiDAR semantic segmentation for safety-critical applications.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译