我们报告了以前未被发现的多项式加强学习(MARL),名为“责任扩散”(DR)。博士导致谈判可靠的责任划分以完成复杂的合作任务。它反映了现有算法如何处理基于价值和基于策略的MARL方法的多种探索难题的缺陷。该DR问题与社会心理学领域(也称为旁观者效应)中具有相同名称的现象具有相似之处。在这项工作中,我们从理论上分析了DR问题的原因开始,我们强调DR问题与奖励成型或信用分配问题无关。为了解决DR问题,我们提出了一种政策共振方法,以改变多种勘探探索策略并促进MARL算法在困难的MARL任务中的性能。大多数现有的MARL算法可以配备此方法,以解决由DR问题引起的性能降解。实验是在多个测试基准任务中进行的,包括FME,诊断性多种环境和竞争性的多基因游戏ADCA。最后,我们在SOTA MARL算法上实施了策略共振方法,以说明这种方法的有效性。
translated by 谷歌翻译
多基础强化学习(MARL)可以解决复杂的合作任务。但是,现有的MAL方法的效率在很大程度上取决于明确定义的奖励功能。具有稀疏奖励反馈的多项式任务尤其具有挑战性,这不仅是由于信用分配问题,而且还因为获得积极的奖励反馈的可能性较低。在本文中,我们设计了一个称为合作图(CG)的图形网络。合作图是两个简单的二分图的组合,即代理聚类子图(ACG)和指定子图(CDG)的群集。接下来,基于这种新颖的图形结构,我们提出了一个合作图多力增强学习(CG-MARL)算法,该算法可以有效地处理多基因任务中的稀疏奖励问题。在CG-MARL中,代理由合作图直接控制。政策神经网络经过培训,可以操纵这一合作图,并指导代理人以隐式的方式实现合作。 CG-MARL的层次结构特征为定制集群活动提供了空间,这是一个可扩展的界面,用于引入基本合作知识。在实验中,CG-MARL在稀疏奖励多基准基准中显示出最新的性能,包括抗侵袭拦截任务和多货车交付任务。
translated by 谷歌翻译
表征比赛风格对于足球俱乐部在侦察,监视和比赛准备方面非常重要。先前的研究将玩家的风格视为技术性能的结合,未能考虑空间信息。因此,这项研究旨在表征中国足球超级联赛(CSL)比赛中每种比赛位置的比赛风格,并集成了最近采用的玩家向量框架。使用了2016 - 2019年CSL的960匹匹配的数据。匹配等级和十种类型的匹配事件与所有阵容播放器的相应坐标均超过45分钟。球员首先被聚集在8个位置。使用非负矩阵分解(NMF),根据播放器向量为每个匹配中的每个玩家构建了播放器向量。在玩家向量上运行另一个NMF进程,以提取不同类型的游戏样式。最终的玩家向量在CSL中发现了18种不同的游戏风格。研究了每种样式的六个性能指标,以观察他们的贡献。总的来说,前锋和中场球员的比赛风格与足球表现的发展趋势一致,而应重新考虑防守者的风格。在高评分的CSL播放器中也发现了多功能游戏风格。
translated by 谷歌翻译
Gaze estimation is the fundamental basis for many visual tasks. Yet, the high cost of acquiring gaze datasets with 3D annotations hinders the optimization and application of gaze estimation models. In this work, we propose a novel Head-Eye redirection parametric model based on Neural Radiance Field, which allows dense gaze data generation with view consistency and accurate gaze direction. Moreover, our head-eye redirection parametric model can decouple the face and eyes for separate neural rendering, so it can achieve the purpose of separately controlling the attributes of the face, identity, illumination, and eye gaze direction. Thus diverse 3D-aware gaze datasets could be obtained by manipulating the latent code belonging to different face attributions in an unsupervised manner. Extensive experiments on several benchmarks demonstrate the effectiveness of our method in domain generalization and domain adaptation for gaze estimation tasks.
translated by 谷歌翻译
The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.
translated by 谷歌翻译
Despite recent progress towards scaling up multimodal vision-language models, these models are still known to struggle on compositional generalization benchmarks such as Winoground. We find that a critical component lacking from current vision-language models is relation-level alignment: the ability to match directional semantic relations in text (e.g., "mug in grass") with spatial relationships in the image (e.g., the position of the mug relative to the grass). To tackle this problem, we show that relation alignment can be enforced by encouraging the directed language attention from 'mug' to 'grass' (capturing the semantic relation 'in') to match the directed visual attention from the mug to the grass. Tokens and their corresponding objects are softly identified using the cross-modal attention. We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention matrices under a 'change of basis' provided by the cross-modal attention matrix. Intuitively, our approach projects visual attention into the language attention space to calculate its divergence from the actual language attention, and vice versa. We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating their robustness against distribution shifts is crucial before adopting them in real-world applications. In this paper, we investigate the robustness of 9 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (MMI and MOR) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models.
translated by 谷歌翻译
Real-time monocular 3D reconstruction is a challenging problem that remains unsolved. Although recent end-to-end methods have demonstrated promising results, tiny structures and geometric boundaries are hardly captured due to their insufficient supervision neglecting spatial details and oversimplified feature fusion ignoring temporal cues. To address the problems, we propose an end-to-end 3D reconstruction network SST, which utilizes Sparse estimated points from visual SLAM system as additional Spatial guidance and fuses Temporal features via a novel cross-modal attention mechanism, achieving more detailed reconstruction results. We propose a Local Spatial-Temporal Fusion module to exploit more informative spatial-temporal cues from multi-view color information and sparse priors, as well a Global Spatial-Temporal Fusion module to refine the local TSDF volumes with the world-frame model from coarse to fine. Extensive experiments on ScanNet and 7-Scenes demonstrate that SST outperforms all state-of-the-art competitors, whilst keeping a high inference speed at 59 FPS, enabling real-world applications with real-time requirements.
translated by 谷歌翻译
Image-text retrieval in remote sensing aims to provide flexible information for data analysis and application. In recent years, state-of-the-art methods are dedicated to ``scale decoupling'' and ``semantic decoupling'' strategies to further enhance the capability of representation. However, these previous approaches focus on either the disentangling scale or semantics but ignore merging these two ideas in a union model, which extremely limits the performance of cross-modal retrieval models. To address these issues, we propose a novel Scale-Semantic Joint Decoupling Network (SSJDN) for remote sensing image-text retrieval. Specifically, we design the Bidirectional Scale Decoupling (BSD) module, which exploits Salience Feature Extraction (SFE) and Salience-Guided Suppression (SGS) units to adaptively extract potential features and suppress cumbersome features at other scales in a bidirectional pattern to yield different scale clues. Besides, we design the Label-supervised Semantic Decoupling (LSD) module by leveraging the category semantic labels as prior knowledge to supervise images and texts probing significant semantic-related information. Finally, we design a Semantic-guided Triple Loss (STL), which adaptively generates a constant to adjust the loss function to improve the probability of matching the same semantic image and text and shorten the convergence time of the retrieval model. Our proposed SSJDN outperforms state-of-the-art approaches in numerical experiments conducted on four benchmark remote sensing datasets.
translated by 谷歌翻译