Complex knowledge base question answering can be achieved by converting questions into sequences of predefined actions. However, there is a significant semantic and structural gap between natural language and action sequences, which makes this conversion difficult. In this paper, we introduce an alignment-enhanced complex question answering framework, called ALCQA, which mitigates this gap through question-to-action alignment and question-to-question alignment. We train a question rewriting model to align the question and each action, and utilize a pretrained language model to implicitly align the question and KG artifacts. Moreover, considering that similar questions correspond to similar action sequences, we retrieve top-k similar question-answer pairs at the inference stage through question-to-question alignment and propose a novel reward-guided action sequence selection strategy to select from candidate action sequences. We conduct experiments on CQA and WQSP datasets, and the results show that our approach outperforms state-of-the-art methods and obtains a 9.88\% improvements in the F1 metric on CQA dataset. Our source code is available at https://github.com/TTTTTTTTy/ALCQA.
translated by 谷歌翻译
Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.
translated by 谷歌翻译
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
Background and Purpose: Colorectal cancer is a common fatal malignancy, the fourth most common cancer in men, and the third most common cancer in women worldwide. Timely detection of cancer in its early stages is essential for treating the disease. Currently, there is a lack of datasets for histopathological image segmentation of rectal cancer, which often hampers the assessment accuracy when computer technology is used to aid in diagnosis. Methods: This present study provided a new publicly available Enteroscope Biopsy Histopathological Hematoxylin and Eosin Image Dataset for Image Segmentation Tasks (EBHI-Seg). To demonstrate the validity and extensiveness of EBHI-Seg, the experimental results for EBHI-Seg are evaluated using classical machine learning methods and deep learning methods. Results: The experimental results showed that deep learning methods had a better image segmentation performance when utilizing EBHI-Seg. The maximum accuracy of the Dice evaluation metric for the classical machine learning method is 0.948, while the Dice evaluation metric for the deep learning method is 0.965. Conclusion: This publicly available dataset contained 5,170 images of six types of tumor differentiation stages and the corresponding ground truth images. The dataset can provide researchers with new segmentation algorithms for medical diagnosis of colorectal cancer, which can be used in the clinical setting to help doctors and patients.
translated by 谷歌翻译
您将如何修复大量错过的物理物体?您可能首先恢复其全球且粗糙的形状,并逐步增加其本地细节。我们有动力模仿上述物理维修程序,以解决点云完成任务。我们为各种3D模型提出了一个新颖的逐步点云完成网络(SPCNET)。 SPCNET具有层次的底部网络体系结构。它以迭代方式实现形状完成,1)首先扩展了粗糙结果的全局特征; 2)然后在全球功能的帮助下注入本地功能; 3)最终借助局部特征和粗糙的结果来渗透详细的结果。除了模拟物理修复的智慧之外,我们还新设计了基于周期损失%的训练策略,以增强SPCNET的概括和鲁棒性。广泛的实验清楚地表明了我们的SPCNET优于3D点云上最先进的方法,但错过了很大。
translated by 谷歌翻译
本文提出了一个简单而有效的框架蒙版,该框架将新提出的掩盖自distillation纳入对比的语言图像预处理中。掩盖自distillation的核心思想是将表示从完整的图像提取到蒙版图像预测的表示形式。这种合并享有两个重要的好处。首先,掩盖的自我验证目标是本地贴片表示学习,这与视觉对比度的互补,专注于与文本相关的表示。二,掩盖的自我验证也与视觉语言对比符合训练目标的视野对比是一致的。视觉编码器用于功能对齐,因此能够学习本地语义从该语言中获得间接监督。我们提供了专门设计的实验,并进行了全面的分析,以验证这两个好处。从经验上讲,我们表明,当MaskClip应用于各种具有挑战性的下游任务时,可以在线性探测,填充和零拍摄中取得卓越的结果,并在语言编码器的指导下取得了卓越的结果。
translated by 谷歌翻译
伴随的药物给药会引起药物 - 药物相互作用(DDIS)。某些药物组合是有益的,但其他药物组合可能会引起以前未记录的负面影响。以前关于DDI预测的工作通常依赖于手工设计的领域知识,这是努力获得的。在这项工作中,我们提出了一个新型模型,即分子亚结构网络(MSAN),以有效预测药物对分子结构的潜在DDI。我们采用类似变压器的子结构提取模块,以获取与药物分子的各种子结构模式相关的固定代表媒介。然后,两种药物的子结构之间的相互作用强度将由基于相似性的相互作用模块捕获。在图形编码之前,我们还执行一个子结构删除增强,以减轻过度拟合。实际数据集的实验结果表明,我们提出的模型实现了最新的性能。我们还表明,通过案例研究,我们的模型的预测是高度解释的。
translated by 谷歌翻译
从RGB-D图像中对刚性对象的6D姿势估计对于机器人技术中的对象抓握和操纵至关重要。尽管RGB通道和深度(d)通道通常是互补的,分别提供了外观和几何信息,但如何完全从两个跨模式数据中完全受益仍然是非平凡的。从简单而新的观察结果来看,当对象旋转时,其语义标签是姿势不变的,而其关键点偏移方向是姿势的变体。为此,我们提出了So(3)pose,这是一个新的表示学习网络,可以探索SO(3)equivariant和So(3) - 从深度通道中进行姿势估计的特征。 SO(3) - 激素特征有助于学习更独特的表示,以分割来自RGB通道外观相似的对象。 SO(3) - 等级特征与RGB功能通信,以推导(缺失的)几何形状,以检测从深度通道的反射表面的对象的关键点。与大多数现有的姿势估计方法不同,我们的SO(3) - 不仅可以实现RGB和深度渠道之间的信息通信,而且自然会吸收SO(3) - 等级的几何学知识,从深度图像中,导致更好的外观和更好的外观和更好几何表示学习。综合实验表明,我们的方法在三个基准测试中实现了最先进的性能。
translated by 谷歌翻译
可靠的导航系统在机器人技术和自动驾驶中具有广泛的应用。当前方法采用开环过程,将传感器输入直接转换为动作。但是,这些开环方案由于概括不佳而在处理复杂而动态的现实情况方面具有挑战性。在模仿人类导航的情况下,我们添加了一个推理过程,将动作转换回内部潜在状态,形成了两阶段的感知,决策和推理的封闭环路。首先,VAE增强的演示学习赋予了模型对基本导航规则的理解。然后,在RL增强交互学习中的两个双重过程彼此产生奖励反馈,并共同增强了避免障碍能力。推理模型可以实质上促进概括和鲁棒性,并促进算法将算法的部署到现实世界的机器人,而无需精心转移。实验表明,与最先进的方法相比,我们的方法更适合新型方案。
translated by 谷歌翻译
我们提出了引导蒙面的自动编码器(bootmae),这是一种新的视觉BERT预训练方法。 Bootmae用两个核心设计改进了原始的蒙版自动编码器(MAE):1)动量编码器,该动量编码器可作为额外的BERT预测目标提供在线功能; 2)试图降低编码器的压力以记住目标特定信息的靶向解码器。第一个设计的动机是通过观察到的,即使用预定的MAE提取特征,因为掩盖令牌的BERT预测目标可以实现更好的预训练性能。因此,我们与原始的MAE编码器并行添加了一个动量编码器,该编码器通过将其自己的表示作为BERT预测目标来引导预处理性能。在第二个设计中,我们将特定于目标的信息(例如,未掩盖贴片的像素值)直接传达到解码器中,以减少记住目标特定信息的编码器的压力。因此,编码器专注于语义建模,这是BERT预训练的目的,并且不需要浪费其在记住与预测目标相关的未掩盖令牌的信息时的能力。通过广泛的实验,我们的Bootmae在ImageNet-1k上获得了$ 84.2 \%$ $ $ $+0.8 \%$在同一预训练时期。 Bootmae还获得了$+1.0 $ MIOU在ADE20K上的语义细分和$+1.3 $ box ap,$+1.4 $+1.4 $ bask ap改进对象检测和可可数据集上的细分。代码在https://github.com/lightdxy/bootmae上发布。
translated by 谷歌翻译