Generalizability to unseen forgery types is crucial for face forgery detectors. Recent works have made significant progress in terms of generalization by synthetic forgery data augmentation. In this work, we explore another path for improving the generalization. Our goal is to reduce the features that are easy to learn in the training phase, so as to reduce the risk of overfitting on specific forgery types. Specifically, in our method, a teacher network takes as input the face images and generates an attention map of the deep features by a diverse multihead attention ViT. The attention map is used to guide a student network to focus on the low-attended features by reducing the highly-attended deep features. A deep feature mixup strategy is also proposed to synthesize forgeries in the feature domain. Experiments demonstrate that, without data augmentation, our method is able to achieve promising performances on unseen forgeries and highly compressed data.
translated by 谷歌翻译
In this work, we investigate improving the generalizability of GAN-generated image detectors by performing data augmentation in the fingerprint domain. Specifically, we first separate the fingerprints and contents of the GAN-generated images using an autoencoder based GAN fingerprint extractor, followed by random perturbations of the fingerprints. Then the original fingerprints are substituted with the perturbed fingerprints and added to the original contents, to produce images that are visually invariant but with distinct fingerprints. The perturbed images can successfully imitate images generated by different GANs to improve the generalization of the detectors, which is demonstrated by the spectra visualization. To our knowledge, we are the first to conduct data augmentation in the fingerprint domain. Our work explores a novel prospect that is distinct from previous works on spatial and frequency domain augmentation. Extensive cross-GAN experiments demonstrate the effectiveness of our method compared to the state-of-the-art methods in detecting fake images generated by unknown GANs.
translated by 谷歌翻译
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io
translated by 谷歌翻译
Object-goal navigation (Object-nav) entails searching, recognizing and navigating to a target object. Object-nav has been extensively studied by the Embodied-AI community, but most solutions are often restricted to considering static objects (e.g., television, fridge, etc.). We propose a modular framework for object-nav that is able to efficiently search indoor environments for not just static objects but also movable objects (e.g. fruits, glasses, phones, etc.) that frequently change their positions due to human intervention. Our contextual-bandit agent efficiently explores the environment by showing optimism in the face of uncertainty and learns a model of the likelihood of spotting different objects from each navigable location. The likelihoods are used as rewards in a weighted minimum latency solver to deduce a trajectory for the robot. We evaluate our algorithms in two simulated environments and a real-world setting, to demonstrate high sample efficiency and reliability.
translated by 谷歌翻译
尽管进行了数十年的研究,但现有的导航系统在野外部署时仍然面临现实世界中的挑战,例如在混乱的家庭环境或人类占领的公共场所中。为了解决这个问题,我们提出了一类新的隐式控制政策,将模仿学习的好处与模型预测控制(MPC)的系统约束的强大处理结合在一起。我们的方法称为Performer-MPC,使用了通过表演者提供的视觉上下文嵌入的学习成本函数(一种低级隐式意见变压器)。我们共同训练成本函数并构建依靠它的控制器,有效地端到端解决相应的双层优化问题。我们表明,由此产生的策略通过利用一些在不同挑战的现实世界情景中利用一些专家演示来提高标准MPC绩效。与标准的MPC政策相比,表演者MPC在混乱的环境中实现了40%的目标,而在人类浏览时,社交指标的目标> 65%。
translated by 谷歌翻译
大型语言模型(LLM)从人类的指示中解开了任务计划的新功能。但是,事先尝试将LLMS应用于现实世界的机器人任务受到周围场景中缺乏接地的限制。在本文中,我们开发了NLMAP,这是一个开放式摄影和可查询场景表示,以解决此问题。 NLMAP是一个框架,可以将上下文信息收集到LLM计划者中,从而在生成上下文条件条件计划之前,可以在场景中查看和查询可用的对象。 NLMAP首先使用视觉语言模型(VLM)建立自然语言可查询场景表示。基于LLM的对象建议模块解析指令并提出涉及的对象,以查询场景表示以获取对象可用性和位置。然后,LLM规划师计划提供有关场景的此类信息。 NLMAP允许机器人在没有固定的对象列表或可执行选项的情况下操作,从而使真实的机器人操作无法通过以前的方法实现。项目网站:https://nlmap-saycan.github.io
translated by 谷歌翻译
已经证明,经过代码完成培训的大型语言模型(LLMS)能够合成DocStrings的简单Python程序[1]。我们发现这些代码编写的LLM可以被重新使用以编写机器人策略代码,给定自然语言命令。具体而言,策略代码可以表达处理感知输出的功能或反馈循环(例如,从对象检测器[2],[3])并参数化控制原始API。当作为输入提供了几个示例命令(格式为注释)后,然后是相应的策略代码(通过少量提示),LLMS可以接收新命令并自主重新编写API调用以分别生成新的策略代码。通过链接经典的逻辑结构并引用第三方库(例如,numpy,shapely)执行算术,以这种方式使用的LLM可以编写(i)(i)表现出空间几何推理的机器人策略,(ii)(ii)将其推广到新的说明和新指令和新指令和(iii)根据上下文(即行为常识)规定模棱两可的描述(例如“更快”)的精确值(例如,速度)。本文将代码作为策略介绍:语言模型生成程序的以机器人为中心的形式化(LMP),该程序可以代表反应性策略(例如阻抗控制器),以及基于Waypoint的策略(基于远见的选择,基于轨迹,基于轨迹,控制),在多个真实的机器人平台上展示。我们方法的核心是促使层次代码 - 代码(递归定义未定义的功能),该代码可以编写更复杂的代码,还可以改善最新的代码,以解决HOMANEVAL [1]基准中的39.8%的问题。代码和视频可从https://code-as-policies.github.io获得。
translated by 谷歌翻译
我们提出了一种保护生成对抗网络(GAN)的知识产权(IP)的水印方法。目的是为GAN模型加水印,以便GAN产生的任何图像都包含一个无形的水印(签名),其在图像中的存在可以在以后的阶段检查以进行所有权验证。为了实现这一目标,在发电机的输出上插入了预先训练的CNN水印解码块。然后通过包括水印损失项来修改发电机损耗,以确保可以从生成的图像中提取规定的水印。水印是通过微调嵌入的,其时间复杂性降低。结果表明,我们的方法可以有效地将无形的水印嵌入生成的图像中。此外,我们的方法是一种通用方法,可以使用不同的GAN体系结构,不同的任务和输出图像的不同分辨率。我们还证明了嵌入式水印的良好鲁棒性能与几个后处理,其中包括JPEG压缩,噪声添加,模糊和色彩转换。
translated by 谷歌翻译
最近,几种基于空间内存的方法已经验证了将中间框架及其面具作为内存有助于将视频中的目标对象细分目标对象。但是,它们主要集中于当前帧和内存框架之间的更好匹配,而无需明确关注内存质量。因此,较差的分割面罩的框架容易被记住,这导致了分割掩盖误差问题并进一步影响分割性能。此外,随着帧数的增长,内存框架的线性增加还限制了模型处理长视频的能力。为此,我们提出了一个质量感知的动态内存网络(QDMN)来评估每个帧的分割质量,从而使内存库可以选择性地存储准确的分段框架,以防止误差积累问题。然后,我们将细分质量与时间一致性相结合,以动态更新内存库以提高模型的实用性。我们的QDMN没有任何铃铛和哨子,在戴维斯和YouTube-Vos基准测试中都取得了新的最新性能。此外,广泛的实验表明,提议的质量评估模块(QAM)可以作为通用插件应用于基于内存的方法,并显着提高性能。我们的源代码可在https://github.com/workforai/qdmn上找到。
translated by 谷歌翻译
我们提出了一种新颖的方法,可以可靠地估计相机的姿势,并在极端环境中获得的一系列图像,例如深海或外星地形。在这些挑战性条件下获得的数据被无纹理表面,图像退化以及重复性和高度模棱两可的结构所破坏。当天真地部署时,最先进的方法可能会在我们的经验分析确认的那些情况下失败。在本文中,我们试图在这些极端情况下使摄像机重新定位起作用。为此,我们提出:(i)一个分层定位系统,我们利用时间信息和(ii)一种新颖的环境感知图像增强方法来提高鲁棒性和准确性。我们广泛的实验结果表明,在两个极端环境下我们的方法有利于我们的方法:将自动的水下车辆定位,并将行星漫游者定位在火星样的沙漠中。此外,我们的方法仅使用20%的培训数据就可以在室内基准(7片数据集)上使用最先进的方法(7片数据集)实现可比性的性能。
translated by 谷歌翻译