随着计算机视觉和NLP的进步,视觉语言(VL)正在成为研究的重要领域。尽管很重要,但研究领域的评估指标仍处于开发的初步阶段。在本文中,我们提出了定量度量的“符号分数”和评估数据集“人类难题”,以评估VL模型是否理解像人类这样的图像。我们观察到,VL模型没有解释输入图像的整体上下文,而是对形成本地上下文的特定对象或形状显示出偏差。我们旨在定量测量模型在理解环境中的表现。为了验证当前现有VL模型的功能,我们将原始输入图像切成零件并随机放置,从而扭曲了图像的全局上下文。我们的论文讨论了每个VL模型在全球环境上的解释水平,并解决了结构特征如何影响结果。
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
The cone-beam computed tomography (CBCT) provides 3D volumetric imaging of a target with low radiation dose and cost compared with conventional computed tomography, and it is widely used in the detection of paranasal sinus disease. However, it lacks the sensitivity to detect soft tissue lesions owing to reconstruction constraints. Consequently, only physicians with expertise in CBCT reading can distinguish between inherent artifacts or noise and diseases, restricting the use of this imaging modality. The development of artificial intelligence (AI)-based computer-aided diagnosis methods for CBCT to overcome the shortage of experienced physicians has attracted substantial attention. However, advanced AI-based diagnosis addressing intrinsic noise in CBCT has not been devised, discouraging the practical use of AI solutions for CBCT. To address this issue, we propose an AI-based computer-aided diagnosis method using CBCT with a denoising module. This module is implemented before diagnosis to reconstruct the internal ground-truth full-dose scan corresponding to an input CBCT image and thereby improve the diagnostic performance. The external validation results for the unified diagnosis of sinus fungal ball, chronic rhinosinusitis, and normal cases show that the proposed method improves the micro-, macro-average AUC, and accuracy by 7.4, 5.6, and 9.6% (from 86.2, 87.0, and 73.4 to 93.6, 92.6, and 83.0%), respectively, compared with a baseline while improving human diagnosis accuracy by 11% (from 71.7 to 83.0%), demonstrating technical differentiation and clinical effectiveness. This pioneering study on AI-based diagnosis using CBCT indicates denoising can improve diagnostic performance and reader interpretability in images from the sinonasal area, thereby providing a new approach and direction to radiographic image reconstruction regarding the development of AI-based diagnostic solutions.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
最近的成功表明,可以通过文本提示来操纵图像,例如,在雨天的晴天,在雨天中被操纵到同一场景中,这是由文本输入“下雨”驱动的雨天。这些方法经常利用基于样式的图像生成器,该生成器利用多模式(文本和图像)嵌入空间。但是,我们观察到,这种文本输入通常在提供和综合丰富的语义提示时被瓶颈瓶颈,例如将大雨与雨雨区分开。为了解决这个问题,我们主张利用另一种方式,声音,在图像操纵中具有显着优势,因为它可以传达出比文本更多样化的语义提示(生动的情感或自然世界的动态表达)。在本文中,我们提出了一种新颖的方法,该方法首先使用声音扩展了图像文本接头嵌入空间,并应用了一种直接的潜在优化方法来根据音频输入(例如雨的声音)操纵给定的图像。我们的广泛实验表明,我们的声音引导的图像操纵方法在语义和视觉上比最先进的文本和声音引导的图像操纵方法产生更合理的操作结果,这通过我们的人类评估进一步证实。我们的下游任务评估还表明,我们学到的图像文本单嵌入空间有效地编码声音输入。
translated by 谷歌翻译
手语制作(SLP)旨在将语言的表达方式转化为手语的相应语言,例如基于骨架的标志姿势或视频。现有的SLP型号是自动回旋(AR)或非自动入口(NAR)。但是,AR-SLP模型在解码过程中遭受了回归对均值和误差传播的影响。 NSLP-G是一种基于NAR的模型,在某种程度上解决了这些问题,但会带来其他问题。例如,它不考虑目标符号长度,并且会遭受虚假解码启动的影响。我们通过知识蒸馏(KD)提出了一种新型的NAR-SLP模型,以解决这些问题。首先,我们设计一个长度调节器来预测生成的符号姿势序列的末端。然后,我们采用KD,该KD从预训练的姿势编码器中提取空间语言特征以减轻虚假解码的启动。广泛的实验表明,所提出的方法在特里切特的手势距离和背面翻译评估中都显着优于现有的SLP模型。
translated by 谷歌翻译
尽管最近在手动和对象数据集中进行了准确的3D注释做出了努力,但3D手和对象重建仍然存在差距。现有作品利用接触地图来完善不准确的手动姿势构成估计,并在给定的对象模型中生成grasps。但是,它们需要明确的3D监督,因此很少可用,因此仅限于受限的设置,例如,热摄像机观察到操纵物体上剩下的残留热量。在本文中,我们提出了一个新颖的半监督框架,使我们能够从单眼图像中学习接触。具体而言,我们利用大规模数据集中的视觉和几何一致性约束来在半监督学习中生成伪标记,并提出一个有效的基于图形的网络来推断联系。我们的半监督学习框架对接受“有限”注释的数据培训的现有监督学习方法取得了良好的改进。值得注意的是,与常用的基于点网的方法相比,我们所提出的模型能够以不到网络参数和内存访问成本的一半以下的一半获得卓越的结果。我们显示出使用触点图的好处,该触点图规则手动相互作用以产生更准确的重建。我们进一步证明,使用伪标签的培训可以将联系地图估计扩展到域外对象,并在多个数据集中更好地概括。
translated by 谷歌翻译
面部行为分析是一个广泛的主题,具有各种类别,例如面部情绪识别,年龄和性别认识,……许多研究都集中在单个任务上,而多任务学习方法仍然开放,需要更多的研究。在本文中,我们为情感行为分析在野外竞争中的多任务学习挑战提供了解决方案和实验结果。挑战是三个任务的组合:动作单元检测,面部表达识别和偶像估计。为了应对这一挑战,我们引入了一个跨集团模块,以提高多任务学习绩效。此外,还应用面部图来捕获动作单元之间的关联。结果,我们在组织者提供的验证数据上实现了1.24的评估度量,这比0.30的基线结果要好。
translated by 谷歌翻译
降水预测是一项重要的科学挑战,对社会产生广泛影响。从历史上看,这项挑战是使用数值天气预测(NWP)模型解决的,该模型基于基于物理的模拟。最近,许多作品提出了一种替代方法,使用端到端深度学习(DL)模型来替代基于物理的NWP。尽管这些DL方法显示出提高的性能和计算效率,但它们在长期预测中表现出局限性,并且缺乏NWP模型的解释性。在这项工作中,我们提出了一个混合NWP-DL工作流程,以填补独立NWP和DL方法之间的空白。在此工作流程下,NWP输出被馈入深层模型,该模型后处理数据以产生精致的降水预测。使用自动气象站(AWS)观测值作为地面真相标签,对深层模型进行了监督训练。这可以实现两全其美,甚至可以从NWP技术的未来改进中受益。为了促进朝这个方向进行研究,我们提出了一个专注于朝鲜半岛的新型数据集,该数据集称为KOMET(KOMEN(KOREA气象数据集),由NWP预测和AWS观察组成。对于NWP,我们使用全局数据同化和预测系统-KOREA集成模型(GDAPS-KIM)。
translated by 谷歌翻译
自从Navier Stokes方程的推导以来,已经有可能在数值上解决现实世界的粘性流问题(计算流体动力学(CFD))。然而,尽管中央处理单元(CPU)的性能取得了迅速的进步,但模拟瞬态流量的计算成本非常小,时间/网格量表物理学仍然是不现实的。近年来,机器学习(ML)技术在整个行业中都受到了极大的关注,这一大浪潮已经传播了流体动力学界的各种兴趣。最近的ML CFD研究表明,随着数据驱动方法的训练时间和预测时间之间的间隔增加,完全抑制了误差的增加是不现实的。应用ML的实用CFD加速方法的开发是剩余的问题。因此,这项研究的目标是根据物理信息传递学习制定现实的ML策略,并使用不稳定的CFD数据集验证了该策略的准确性和加速性能。该策略可以在监视跨耦合计算框架中管理方程的残差时确定转移学习的时间。因此,我们的假设是可行的,即连续流体流动时间序列的预测是可行的,因为中间CFD模拟定期不仅减少了增加残差,还可以更新网络参数。值得注意的是,具有基于网格的网络模型的交叉耦合策略不会损害计算加速度的仿真精度。在层流逆流CFD数据集条件下,该模拟加速了1.8次,包括参数更新时间。此可行性研究使用了开源CFD软件OpenFOAM和开源ML软件TensorFlow。
translated by 谷歌翻译