The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Transformers are widely used in NLP tasks. However, current approaches to leveraging transformers to understand language expose one weak spot: Number understanding. In some scenarios, numbers frequently occur, especially in semi-structured data like tables. But current approaches to rich-number tasks with transformer-based language models abandon or lose some of the numeracy information - e.g., breaking numbers into sub-word tokens - which leads to many number-related errors. In this paper, we propose the LUNA framework which improves the numerical reasoning and calculation capabilities of transformer-based language models. With the number plugin of NumTok and NumBed, LUNA represents each number as a whole to model input. With number pre-training, including regression loss and model distillation, LUNA bridges the gap between number and vocabulary embeddings. To the best of our knowledge, this is the first work that explicitly injects numeracy capability into language models using Number Plugins. Besides evaluating toy models on toy tasks, we evaluate LUNA on three large-scale transformer models (RoBERTa, BERT, TabBERT) over three different downstream tasks (TATQA, TabFact, CrediTrans), and observe the performances of language models are constantly improved by LUNA. The augmented models also improve the official baseline of TAT-QA (EM: 50.15 -> 59.58) and achieve SOTA performance on CrediTrans (F1 = 86.17).
translated by 谷歌翻译
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
translated by 谷歌翻译
Quantifying the perceptual similarity of two images is a long-standing problem in low-level computer vision. The natural image domain commonly relies on supervised learning, e.g., a pre-trained VGG, to obtain a latent representation. However, due to domain shift, pre-trained models from the natural image domain might not apply to other image domains, such as medical imaging. Notably, in medical imaging, evaluating the perceptual similarity is exclusively performed by specialists trained extensively in diverse medical fields. Thus, medical imaging remains devoid of task-specific, objective perceptual measures. This work answers the question: Is it necessary to rely on supervised learning to obtain an effective representation that could measure perceptual similarity, or is self-supervision sufficient? To understand whether recent contrastive self-supervised representation (CSR) may come to the rescue, we start with natural images and systematically evaluate CSR as a metric across numerous contemporary architectures and tasks and compare them with existing methods. We find that in the natural image domain, CSR behaves on par with the supervised one on several perceptual tests as a metric, and in the medical domain, CSR better quantifies perceptual similarity concerning the experts' ratings. We also demonstrate that CSR can significantly improve image quality in two image synthesis tasks. Finally, our extensive results suggest that perceptuality is an emergent property of CSR, which can be adapted to many image domains without requiring annotations.
translated by 谷歌翻译
Online forms are widely used to collect data from human and have a multi-billion market. Many software products provide online services for creating semi-structured forms where questions and descriptions are organized by pre-defined structures. However, the design and creation process of forms is still tedious and requires expert knowledge. To assist form designers, in this work we present FormLM to model online forms (by enhancing pre-trained language model with form structural information) and recommend form creation ideas (including question / options recommendations and block type suggestion). For model training and evaluation, we collect the first public online form dataset with 62K online forms. Experiment results show that FormLM significantly outperforms general-purpose language models on all tasks, with an improvement by 4.71 on Question Recommendation and 10.6 on Block Type Suggestion in terms of ROUGE-1 and Macro-F1, respectively.
translated by 谷歌翻译
在本文中,我们研究了神经视频压缩(NVC)中位分配的问题。首先,我们揭示了最近声称是最佳的位分配方法实际上是由于其实施而是最佳的。具体而言,我们发现其亚典型性在于半损坏的变异推理(SAVI)对潜在的不正确的应用,具有非物质变异后验。然后,我们表明,在非因素潜伏期上校正的SAVI校正版本需要递归地通过梯度上升应用后传播,这是我们得出校正后的最佳位分配算法的。由于校正位分配的计算不可行性,我们设计了有效的近似值以使其实用。经验结果表明,我们提出的校正显着改善了R-D性能和比特率误差的错误分配,并且比所有其他位分配方法都大大提高了。源代码在补充材料中提供。
translated by 谷歌翻译
非视线(NLOS)成像是一种用于检测障碍物或角落周围物体的物体的新兴技术。关于被动NLOS的最新研究主要集中在稳态测量和重建方法上,这些方法显示出识别移动目标的局限性。据我们所知,我们提出了一种新颖的基于事件的无源NLOS成像方法。我们获得了基于事件的异步数据,其中包含NLOS目标的详细动态信息,并有效缓解由运动引起的斑点降解。此外,我们创建了第一个基于事件的NLOS成像数据集NLOS-ES,并且由时间表面表示提取基于事件的功能。我们通过基于事件的数据与基于框架的数据比较重建。基于事件的方法在PSNR和LPIP上表现良好,该方法比基于框架的方法好20%和10%,而数据量仅占传统方法的2%。
translated by 谷歌翻译
神经辐射场(NERF)已成功用于场景表示。最近的工作还使用基于NERF的环境表示形式开发了机器人导航和操纵系统。由于对象定位是许多机器人应用的基础,因此进一步释放了机器人系统中NERF的潜力,我们研究了NERF场景中的对象定位。我们提出了一个基于变压器的框架NERF-LOC,以在NERF场景中提取3D边界对象框。 Nerf-Loc将预先训练的NERF模型和相机视图作为输入,并产生标记为3D边界对象的框作为输出。具体来说,我们设计了一对平行的变压器编码器分支,即粗流和细流,以编码目标对象的上下文和详细信息。然后将编码的功能与注意层融合在一起,以减轻准确对象定位的歧义。我们已经将我们的方法与基于传统变压器的方法进行了比较,我们的方法可以实现更好的性能。此外,我们还提出了第一个基于NERF样品的对象定位基准Nerflocbench。
translated by 谷歌翻译
在各种基于学习的图像恢复任务(例如图像降解和图像超分辨率)中,降解表示形式被广泛用于建模降解过程并处理复杂的降解模式。但是,在基于学习的图像deblurring中,它们的探索程度较低,因为在现实世界中挑战性的情况下,模糊内核估计不能很好地表现。我们认为,对于图像降低的降解表示形式是特别必要的,因为模糊模式通常显示出比噪声模式或高频纹理更大的变化。在本文中,我们提出了一个框架来学习模糊图像的空间自适应降解表示。提出了一种新颖的联合图像re毁和脱蓝色的学习过程,以提高降解表示的表现力。为了使学习的降解表示有效地启动和降解,我们提出了一个多尺度退化注入网络(MSDI-NET),以将它们集成到神经网络中。通过集成,MSDI-NET可以适应各种复杂的模糊模式。 GoPro和Realblur数据集上的实验表明,我们提出的具有学识渊博的退化表示形式的Deblurring框架优于最先进的方法,具有吸引人的改进。该代码在https://github.com/dasongli1/learning_degradation上发布。
translated by 谷歌翻译
3D肺部片段的重建在肺癌的外科治疗计划中起着重要作用,这有助于保存肺功能并有助于确保低复发率。但是,在深度学习时代,肺部段的自动重建仍未得到探索。在本文中,我们研究了是什么使肺部段自动重建。首先,我们在临床和几何上表达了肺部段的解剖学定义,并提出了遵守这些定义的评估指标。其次,我们提出了脉冲(隐式肺部段),这是一种旨在肺部段重建的深层隐式表面模型。通过脉冲自动重建肺部段的指标和视觉吸引力是准确的。与规范分割方法相比,冲动输出连续预测任意分辨率具有较高的训练效率和更少的参数。最后,我们尝试不同的网络输入,以分析肺部段重建任务中重要的事情。我们的代码可在https://github.com/m3dv/impulse上找到。
translated by 谷歌翻译