Medical Visual Question Answering (Medical-VQA) aims to answer clinical questions regarding radiology images, assisting doctors with decision-making options. Nevertheless, current Medical-VQA models learn cross-modal representations through residing vision and texture encoders in dual separate spaces, which lead to indirect semantic alignment. In this paper, we propose UnICLAM, a Unified and Interpretable Medical-VQA model through Contrastive Representation Learning with Adversarial Masking. Specifically, to learn an aligned image-text representation, we first establish a unified dual-stream pre-training structure with the gradually soft-parameter sharing strategy. Technically, the proposed strategy learns a constraint for the vision and texture encoders to be close in a same space, which is gradually loosened as the higher number of layers. Moreover, for grasping the semantic representation, we extend the unified Adversarial Masking data augmentation strategy to the contrastive representation learning of vision and text in a unified manner, alleviating the meaningless of the commonly used random mask. Concretely, while the encoder training minimizes the distance between the original feature and the masking feature, the adversarial masking model keeps adversarial learning to conversely maximize the distance. Furthermore, we also intuitively take a further exploration of the unified adversarial masking strategy, which improves the potential ante-hoc interpretability with remarkable performance and efficiency. Experimental results on VQA-RAD and SLAKE public benchmarks demonstrate that UnICLAM outperforms the existing 11 state-of-the-art Medical-VQA models. More importantly, we make an additional discussion about the performance of UnICLAM in diagnosing heart failure, verifying that UnICLAM exhibits superior few-shot adaption performance in practical disease diagnosis.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
translated by 谷歌翻译
近年来,人们普遍关注基于卷积的神经网络(CNN)的盲图质量评估(IQA)。大量作品首先从CNN中提取深度功能。然后,通过空间平均池(SAP)和完全连接的层来处理这些特征以预测质量。在本文中,我们受到完整参考IQA和纹理功能的启发,我们通过合并高阶矩(例如方差,偏度),将SAP($ 1^{st} $矩)扩展到空间矩池(SMP)。此外,我们在计算较高矩的梯度时提供了学习友好的归一化以规避数值问题。实验结果表明,仅将SAP升级到SMP可以显着增强基于CNN的盲目IQA方法,并达到最先进的性能状态。
translated by 谷歌翻译
在本文中,我们研究了神经视频压缩(NVC)中位分配的问题。首先,我们揭示了最近声称是最佳的位分配方法实际上是由于其实施而是最佳的。具体而言,我们发现其亚典型性在于半损坏的变异推理(SAVI)对潜在的不正确的应用,具有非物质变异后验。然后,我们表明,在非因素潜伏期上校正的SAVI校正版本需要递归地通过梯度上升应用后传播,这是我们得出校正后的最佳位分配算法的。由于校正位分配的计算不可行性,我们设计了有效的近似值以使其实用。经验结果表明,我们提出的校正显着改善了R-D性能和比特率误差的错误分配,并且比所有其他位分配方法都大大提高了。源代码在补充材料中提供。
translated by 谷歌翻译
本文考虑了有损神经图像压缩(NIC)的问题。当前的最新方法(SOTA)方法采用近似量化噪声的后部均匀的后方,单样本估计量近似于证据下限(ELBO)的梯度。在本文中,我们建议用多个样本重要性加权自动编码器(IWAE)目标训练NIC,该目标比Elbo更紧,并随着样本量的增加而收敛至对数的可能性。首先,我们确定NIC的均匀后验具有特殊的特性,这会影响IWAE目标的Pathiswise和得分函数估计器的方差和偏差。此外,从梯度差异的角度来看,我们提供了有关NIC中通常采用的技巧的见解。基于这些分析,我们进一步提出了多样本NIC(MS-NIC),这是NIC的IWAE靶标。实验结果表明,它改善了SOTA NIC方法。我们的MS-NIC是插件,可以轻松扩展到其他神经压缩任务。
translated by 谷歌翻译
非视线(NLOS)成像是一种用于检测障碍物或角落周围物体的物体的新兴技术。关于被动NLOS的最新研究主要集中在稳态测量和重建方法上,这些方法显示出识别移动目标的局限性。据我们所知,我们提出了一种新颖的基于事件的无源NLOS成像方法。我们获得了基于事件的异步数据,其中包含NLOS目标的详细动态信息,并有效缓解由运动引起的斑点降解。此外,我们创建了第一个基于事件的NLOS成像数据集NLOS-ES,并且由时间表面表示提取基于事件的功能。我们通过基于事件的数据与基于框架的数据比较重建。基于事件的方法在PSNR和LPIP上表现良好,该方法比基于框架的方法好20%和10%,而数据量仅占传统方法的2%。
translated by 谷歌翻译
神经辐射场(NERF)已成功用于场景表示。最近的工作还使用基于NERF的环境表示形式开发了机器人导航和操纵系统。由于对象定位是许多机器人应用的基础,因此进一步释放了机器人系统中NERF的潜力,我们研究了NERF场景中的对象定位。我们提出了一个基于变压器的框架NERF-LOC,以在NERF场景中提取3D边界对象框。 Nerf-Loc将预先训练的NERF模型和相机视图作为输入,并产生标记为3D边界对象的框作为输出。具体来说,我们设计了一对平行的变压器编码器分支,即粗流和细流,以编码目标对象的上下文和详细信息。然后将编码的功能与注意层融合在一起,以减轻准确对象定位的歧义。我们已经将我们的方法与基于传统变压器的方法进行了比较,我们的方法可以实现更好的性能。此外,我们还提出了第一个基于NERF样品的对象定位基准Nerflocbench。
translated by 谷歌翻译
在本文中,我们考虑了神经视频压缩(NVC)中位分配的问题。由于帧参考结构,使用相同的R-D(速率)权衡参数$ \ lambda $的当前NVC方法是次优的,这带来了位分配的需求。与以前基于启发式和经验R-D模型的方法不同,我们建议通过基于梯度的优化解决此问题。具体而言,我们首先提出了一种基于半损坏的变异推理(SAVI)的连续位实现方法。然后,我们通过更改SAVI目标,使用迭代优化提出了一个像素级隐式分配方法。此外,我们基于NVC的可区分特征得出了精确的R-D模型。我们通过使用精确的R-D模型证明其等效性与位分配的等效性来展示我们的方法的最佳性。实验结果表明,我们的方法显着改善了NVC方法,并且胜过现有的位分配方法。我们的方法是所有可区分NVC方法的插件,并且可以直接在现有的预训练模型上采用。
translated by 谷歌翻译
神经图像压缩(NIC)的表现优于传统图像编解码器(R-D)性能。但是,它通常需要R-D曲线上每个点的专用编码器对,这极大地阻碍了其实际部署。尽管最近的一些作品通过有条件的编码实现了比特率控制,但它们在训练过程中施加了强大的先验,并提供了有限的灵活性。在本文中,我们提出了代码编辑,这是一种基于半损坏的推理和自适应量化的NIC的高度灵活的编码方法。我们的工作是可变比特率NIC的新范式。此外,实验结果表明,我们的方法超过了现有的可变速率方法,并通过单个解码器实现了ROI编码和多功能权衡。
translated by 谷歌翻译