Advanced visual localization techniques encompass image retrieval challenges and 6 Degree-of-Freedom (DoF) camera pose estimation, such as hierarchical localization. Thus, they must extract global and local features from input images. Previous methods have achieved this through resource-intensive or accuracy-reducing means, such as combinatorial pipelines or multi-task distillation. In this study, we present a novel method called SuperGF, which effectively unifies local and global features for visual localization, leading to a higher trade-off between localization accuracy and computational efficiency. Specifically, SuperGF is a transformer-based aggregation model that operates directly on image-matching-specific local features and generates global features for retrieval. We conduct experimental evaluations of our method in terms of both accuracy and efficiency, demonstrating its advantages over other methods. We also provide implementations of SuperGF using various types of local features, including dense and sparse learning-based or hand-crafted descriptors.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
In this paper, we consider the inventory management (IM) problem where we need to make replenishment decisions for a large number of stock keeping units (SKUs) to balance their supply and demand. In our setting, the constraint on the shared resources (such as the inventory capacity) couples the otherwise independent control for each SKU. We formulate the problem with this structure as Shared-Resource Stochastic Game (SRSG)and propose an efficient algorithm called Context-aware Decentralized PPO (CD-PPO). Through extensive experiments, we demonstrate that CD-PPO can accelerate the learning procedure compared with standard MARL algorithms.
translated by 谷歌翻译
Image-text retrieval in remote sensing aims to provide flexible information for data analysis and application. In recent years, state-of-the-art methods are dedicated to ``scale decoupling'' and ``semantic decoupling'' strategies to further enhance the capability of representation. However, these previous approaches focus on either the disentangling scale or semantics but ignore merging these two ideas in a union model, which extremely limits the performance of cross-modal retrieval models. To address these issues, we propose a novel Scale-Semantic Joint Decoupling Network (SSJDN) for remote sensing image-text retrieval. Specifically, we design the Bidirectional Scale Decoupling (BSD) module, which exploits Salience Feature Extraction (SFE) and Salience-Guided Suppression (SGS) units to adaptively extract potential features and suppress cumbersome features at other scales in a bidirectional pattern to yield different scale clues. Besides, we design the Label-supervised Semantic Decoupling (LSD) module by leveraging the category semantic labels as prior knowledge to supervise images and texts probing significant semantic-related information. Finally, we design a Semantic-guided Triple Loss (STL), which adaptively generates a constant to adjust the loss function to improve the probability of matching the same semantic image and text and shorten the convergence time of the retrieval model. Our proposed SSJDN outperforms state-of-the-art approaches in numerical experiments conducted on four benchmark remote sensing datasets.
translated by 谷歌翻译
We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm. Our method not only trades off the RL and BC signals with per-state weights (i.e., strong BC regularization on the states with narrow action coverage, and vice versa) but also avoids selecting OOD actions thanks to the mode-seeking property of reverse KL. Empirically, our algorithm can outperform existing offline RL algorithms in the MuJoCo locomotion tasks with the standard D4RL datasets as well as the mixed datasets that combine the standard datasets.
translated by 谷歌翻译
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.
translated by 谷歌翻译
Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.
translated by 谷歌翻译
最近对结构偏见进行了针对情感三胞胎提取(ASTE)的利用,并改善了性能。另一方面,人们认识到,明确纳入结构偏见会对效率产生负面影响,而预验证的语言模型(PLM)已经可以捕获隐式结构。因此,出现了一个自然的问题:在PLM的背景下,结构性偏见仍然是必要的吗?为了回答这个问题,我们建议通过使用适配器在PLM中整合结构偏置并使用便宜的计算相对位置结构来代替句法依赖性结构来解决效率问题。基准评估是在Semeval数据集上进行的。结果表明,我们提出的结构适配器对PLM有益,并在一系列强大的基准范围内实现最先进的性能,但具有光参数需求和延迟较低。同时,我们引起了人们的担忧,即当前的评估默认值为小规模的数据不足。因此,我们为ASTE发布了一个大型数据集。新数据集的结果暗示,结构适配器在大规模上自信地有效和有效。总体而言,我们得出一个结论,即即使使用PLM,结构偏见仍然是必要的。
translated by 谷歌翻译
卷积神经网络(CNN)已经实现了医学图像细分的最先进性能,但需要大量的手动注释进行培训。半监督学习(SSL)方法有望减少注释的要求,但是当数据集大小和注释图像的数量较小时,它们的性能仍然受到限制。利用具有类似解剖结构的现有注释数据集来协助培训,这有可能改善模型的性能。然而,由于目标结构的外观不同甚至成像方式,跨解剖结构域的转移进一步挑战。为了解决这个问题,我们提出了跨解剖结构域适应(CS-CADA)的对比度半监督学习,该学习适应一个模型以在目标结构域中细分相似的结构,这仅需要通过利用一组现有现有的现有的目标域中的限制注释源域中相似结构的注释图像。我们使用特定领域的批归归量表(DSBN)来单独地标准化两个解剖域的特征图,并提出跨域对比度学习策略,以鼓励提取域不变特征。它们被整合到一个自我兼容的均值老师(SE-MT)框架中,以利用具有预测一致性约束的未标记的目标域图像。广泛的实验表明,我们的CS-CADA能够解决具有挑战性的跨解剖结构域移位问题,从而在视网膜血管图像和心脏MR图像的帮助下,在X射线图像中准确分割冠状动脉,并借助底底图像,分别仅给定目标域中的少量注释。
translated by 谷歌翻译
像有声读物的综合一样,表达性语音综合仍然对样式表示学习和预测仍然具有挑战性。从参考音频或从文本预测样式标签中得出的标签需要大量标记的数据,这是昂贵的,并且难以准确定义和注释。在本文中,我们提出了一个新颖的框架,以一种自我监督的方式从丰富的纯文本中学习样式表示。它利用情感词典,并使用对比度学习和深度聚类。我们进一步将样式表示形式整合为多式变压器TTS中的条件嵌入。通过预测在同一数据集上训练的样式标签,但通过人类注释,我们的方法根据对声音域内和室外测试集的主观评估来改进结果,从而获得了改进的结果。此外,有了隐性的背景感知样式表示,长期综合音频的情感过渡似乎更自然。音频样本可在演示网络上找到。
translated by 谷歌翻译