As more and more artificial intelligence (AI) technologies move from the laboratory to real-world applications, the open-set and robustness challenges brought by data from the real world have received increasing attention. Data augmentation is a widely used method to improve model performance, and some recent works have also confirmed its positive effect on the robustness of AI models. However, most of the existing data augmentation methods are heuristic, lacking the exploration of their internal mechanisms. We apply the explainable artificial intelligence (XAI) method, explore the internal mechanisms of popular data augmentation methods, analyze the relationship between game interactions and some widely used robustness metrics, and propose a new proxy for model robustness in the open-set environment. Based on the analysis of the internal mechanisms, we develop a mask-based boosting method for data augmentation that comprehensively improves several robustness measures of AI models and beats state-of-the-art data augmentation approaches. Experiments show that our method can be widely applied to many popular data augmentation methods. Different from the adversarial training, our boosting method not only significantly improves the robustness of models, but also improves the accuracy of test sets. Our code is available at \url{https://github.com/Anonymous_for_submission}.
translated by 谷歌翻译
Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.
translated by 谷歌翻译
In recent years, machine learning has achieved impressive results across different application areas. However, machine learning algorithms do not necessarily perform well on a new domain with a different distribution than its training set. Domain Adaptation (DA) is used to mitigate this problem. One approach of existing DA algorithms is to find domain invariant features whose distributions in the source domain are the same as their distribution in the target domain. In this paper, we propose to let the classifier that performs the final classification task on the target domain learn implicitly the invariant features to perform classification. It is achieved via feeding the classifier during training generated fake samples that are similar to samples from both the source and target domains. We call these generated samples domain-agnostic samples. To accomplish this we propose a novel variation of generative adversarial networks (GAN), called the MiddleGAN, that generates fake samples that are similar to samples from both the source and target domains, using two discriminators and one generator. We extend the theory of GAN to show that there exist optimal solutions for the parameters of the two discriminators and one generator in MiddleGAN, and empirically show that the samples generated by the MiddleGAN are similar to both samples from the source domain and samples from the target domain. We conducted extensive evaluations using 24 benchmarks; on the 24 benchmarks, we compare MiddleGAN against various state-of-the-art algorithms and outperform the state-of-the-art by up to 20.1\% on certain benchmarks.
translated by 谷歌翻译
文本到图像生成旨在生成与给定文本一致的真实图像。先前的作品主要通过堆叠生成器 - 歧义器对进行多个对抗训练,主要采用多阶段体系结构,在该培训中,用于提供发电指导的文本语义在所有阶段都保持静态。这项工作认为,每个阶段的文本特征应根据历史阶段的状态(即历史阶段的文本和图像特征)进行自适应重新组合,以在粗到精细的生成过程中提供多样化和准确的语义指导。因此,我们提出了一种新颖的动力学语义演化gan(DSE-GAN),以在新颖的单一对抗性多阶段体系结构下重新构成每个阶段的文本特征。具体而言,我们设计(1)动态语义演化(DSE)模块,该模块首先汇总了历史图像特征以总结生成反馈,然后动态选择在每个阶段重新组装的单词,并通过动态地组装它们增强或抑制不同的粒度子空间的语义。 (2)单个对抗性多阶段体系结构(SAMA),通过消除复杂的多个对抗训练要求扩展了先前的结构,因此可以允许更多的文本图像相互作用阶段,并最终促进DSE模块。我们进行了全面的实验,并表明DSE-GAN在两个广泛使用的基准分别(即CUB-200和MSCOCO)上获得了7.48 \%和37.8%的相对FID。
translated by 谷歌翻译
离线增强学习(RL)旨在使用先前收集的静态数据集学习最佳策略,是RL的重要范式。由于函数近似错误在分布外动作上的功能近似错误,因此在此任务上的标准RL方法通常会表现较差。尽管已经提出了各种正规化方法来减轻此问题,但它们通常受到表达有限的策略类别的限制,有时会导致次优的解决方案。在本文中,我们提出了利用条件扩散模型作为行为克隆和策略正则化的高度表达政策类别的扩散-QL。在我们的方法中,我们学习了一个动作值函数,并在有条件扩散模型的训练损失中添加了最大化动作值的术语,这导致损失寻求接近行为政策的最佳动作。我们展示了基于扩散模型的策略的表现力以及在扩散模型下的行为克隆和策略改进的耦合都有助于扩散-QL的出色性能。我们在具有多模式行为策略的简单2D强盗示例中说明了我们的方法和先前的工作。然后,我们证明我们的方法可以在离线RL的大多数D4RL基准任务上实现最先进的性能。
translated by 谷歌翻译
本文是第一个提供全面的系统设计概述以及融合方法选择标准的现实世界合作自动驾驶系统的选择标准,该标准为基础架构增强自主驾驶或IAAD。我们在路边和车辆侧计算和通信平台上介绍了IAAD硬件和软件的深入介绍。我们在现实部署方案的背景下广泛地表征了IAAD系统,并观察到沿着道路波动的网络状况是目前是合作自动驾驶的主要技术障碍。为了应对这一挑战,我们提出了新的融合方法,称为“框架间融合”和“计划融合”,以补充当前最新的“框架内融合”。我们证明,每种融合方法都有其自身的好处和约束。
translated by 谷歌翻译
随着可解释的人工智能(XAI)的快速发展,过去的一系列工作表明,基于扰动后的HOC XAI模型中对分布外(OOD)问题的担忧和解释在社会上是错误对准的。我们探讨了使用近似值来模仿黑盒模型的行为的事后解释方法的局限性。然后,我们提出了基于解释的反事实再培训(XCR),提取迅速提取的特征。 XCR应用了XAI模型生成的解释作为反事实输入,以重新培训黑框模型来解决OOD和社会错位问题。对流行图像数据集的评估表明,XCR只能保留12.5%的最关键功能而不更改黑框模型结构时,可以改善模型性能。此外,对腐败数据集基准的评估表明,XCR对改善模型鲁棒性非常有帮助,并积极影响OOD问题的校准。即使没有像某些OOD校准方法那样在验证集中进行校准,但损坏的数据度量标准的表现优于现有方法。如果应用了验证集上的校准,我们的方法还可以在OOD校准度量上使用当前的OOD校准方法。
translated by 谷歌翻译
本文提出了概率共形预测(PCP),这是一种预测推理算法,该算法通过不连续的预测集估算目标变量。给定输入,PCP基于估计生成模型的随机样品构建预测集。它有效且与显式或隐式有条件生成模型兼容。从理论上讲,我们表明PCP可以保证使用有限样品正确的边际覆盖范围。从经验上讲,我们研究了PCP在各种模拟和真实数据集上。与现有的共形推断方法相比,PCP提供了更清晰的预测集。
translated by 谷歌翻译
为了稳定地训练生成对抗网络(GAN),将实例噪声注入歧视器的输入中被认为是理论上的声音解决方案,但是,在实践中尚未实现其承诺。本文介绍了采用高斯混合物分布的扩散 - 在正向扩散链的所有扩散步骤中定义,以注入实例噪声。从观察到或生成的数据扩散的混合物中的随机样品被作为歧视器的输入。通过将其梯度通过前向扩散链进行反向传播来更新,该链的长度可自适应地调节以控制每个训练步骤允许的最大噪声与数据比率。理论分析验证了所提出的扩散gan的声音,该扩散器提供了模型和域 - 不可分割的可区分增强。在各种数据集上进行的一系列实验表明,扩散 - GAN可以提供稳定且具有数据效率的GAN训练,从而使对强GAN基准的性能保持一致,以综合构成照片现实的图像。
translated by 谷歌翻译
在本文中,我们呈现了UFFORER,一种用于图像恢复的有效和高效的变换器架构,其中我们使用变压器块构建分层编码器解码器网络。在UFFAR中,有两个核心设计。首先,我们介绍了一个新颖的本地增强型窗口(Lewin)变压器块,其执行基于窗口的自我关注而不是全局自我关注。它显着降低了高分辨率特征映射的计算复杂性,同时捕获本地上下文。其次,我们提出了一种以多尺度空间偏置的形式提出了一种学习的多尺度恢复调制器,以调整UFFORER解码器的多个层中的特征。我们的调制器展示了卓越的能力,用于恢复各种图像恢复任务的详细信息,同时引入边缘额外参数和计算成本。通过这两个设计提供支持,UFFORER享有高能力,可以捕获本地和全局依赖性的图像恢复。为了评估我们的方法,在几种图像恢复任务中进行了广泛的实验,包括图像去噪,运动脱棕,散焦和污染物。没有钟声和口哨,与最先进的算法相比,我们的UFormer实现了卓越的性能或相当的性能。代码和模型可在https://github.com/zhendongwang6/uformer中找到。
translated by 谷歌翻译