The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.
translated by 谷歌翻译
现有的基于匹配的方法通过从像素级内存中检索支持功能执行视频对象细分(VOS),而某些像素可能会遭受内存中缺乏对应关系(即看不见),这不可避免地限制了他们的细分性能。在本文中,我们提出了一个两流网络(TSN)。我们的TSN包含(i)带有常规像素级内存的像素流,以根据其像素级内存检索分割可见像素。 (ii)一个看不见的像素的实例流,其中对实例的整体理解是在动态分割头上以基于目标实例的特征进行条件的。 (iii)一个像素划分模块生成路由图,将两个流的输出嵌入在一起融合在一起。紧凑的实例流有效地提高了看不见的像素的分割精度,同时将两个流与自适应路由图融合在一起,导致整体性能提升。通过广泛的实验,我们证明了我们提出的TSN的有效性,并且还报告了2018年YouTube-VOS的最先进性能为86.1%,在Davis-2017验证案例中为87.5%。
translated by 谷歌翻译
本文着重于通过分散网络的在线内核学习。网络中的每个代理都会在本地接收连续流数据,并协同工作以学习一个非线性预测函数,该功能在复制的内核希尔伯特空间中相对于所有代理的总瞬时成本而言是最佳的。为了规避传统在线内核学习中维度问题的诅咒,我们利用随机功能(RF)映射将非参数内核学习问题转换为RF空间中的固定长度参数。然后,我们建议通过线性化ADMM(ODKLA)有效地解决在线分散的内核内核学习问题,提出一个名为在线分散内核学习的新颖学习框架。为了进一步提高沟通效率,我们在通信阶段添加了量化和审查策略,并开发了量化和通信的ODKLA(QC-ODKLA)算法。从理论上讲,我们证明了Odkla和Qc-odkla都可以在$ t $ time插槽上实现最佳的Sublinear后悔$ \ Mathcal {O}(\ sqrt {t})$。通过数值实验,我们评估了所提出方法的学习效率,沟通和计算效率。
translated by 谷歌翻译
我们提出了一个简单而有效的完全卷积的一阶段3D对象检测器,用于自主驾驶场景的LIDAR点云,称为FCOS-LIDAR。与使用鸟眼视图(BEV)的主要方法不同,我们提出的检测器从激光雷达点的范围视图(RV,又称范围图像)中检测对象。由于范围视图的紧凑性和与LIDAR传感器在自动驾驶汽车上的采样过程的兼容性,因此可以通过仅利用Vanilla 2D卷积来实现基于范围视图的对象检测器,而脱离了基于BEV的方法,这些方法通常涉及复杂的方法体素化操作和稀疏卷积。我们首次表明,仅具有标准2D卷积的基于RV的3D检测器就可以实现与基于BEV的最新检测器相当的性能,同时更快,更简单。更重要的是,几乎所有以前的基于范围视图的检测器都只关注单帧点云,因为将多帧点云融合到单个范围视图中是具有挑战性的。在这项工作中,我们通过新颖的范围视图投影机制解决了这个具有挑战性的问题,并首次展示了基于范围视图的检测器融合多帧点云的好处。关于Nuscenes的广泛实验表明了我们提出的方法的优越性,我们认为我们的工作可以有力证明基于RV的3D检测器可以与当前基于BEV的主流探测器相比。
translated by 谷歌翻译
我们提出了一种直接的,基于回归的方法,以从单个图像中估计2D人姿势。我们将问题提出为序列预测任务,我们使用变压器网络解决了问题。该网络直接学习了从图像到关键点坐标的回归映射,而无需诉诸中间表示(例如热图)。这种方法避免了与基于热图的方法相关的许多复杂性。为了克服以前基于回归的方法的特征错位问题,我们提出了一种注意机制,该机制适应与目标关键最相关的功能,从而大大提高了准确性。重要的是,我们的框架是端到端的可区分,并且自然学会利用关键点之间的依赖关系。两个主要的姿势估计数据集在MS-Coco和MPII上进行的实验表明,我们的方法在基于回归的姿势估计中的最新方法显着改善。更值得注意的是,与最佳的基于热图的姿势估计方法相比,我们的第一种基于回归的方法是有利的。
translated by 谷歌翻译
联合学习通过融合来自本地节点的协作模型来从分散的数据中学习。然而,FedAVG平均的传统基于坐标的模型忽略了每个参数编码的随机信息,并且可能遭受结构特征未对准。在这项工作中,我们提出了Fed2,一个功能对齐的联合学习框架来解决这个问题,通过在协作模型上建立一个坚定的结构特征对齐来解决这个问题。 FED2由两种主要设计组成:首先,我们设计了一个面向功能的模型结构适应方法,以确保不同神经网络结构中的显式功能分配。将结构适应应用于协作模型,可以在非常早期的训练阶段初始化具有类似特征信息的匹配结构。在联合学习过程中,我们提出了一个特征配对的平均方案,以保证对齐的特征分布,并在IID或非IID方案下维护没有特征融合冲突。最终,FED2可以在广泛的同源和异构环境下有效地提高联合学习收敛性能,提供出色的收敛速度,准确性和计算/通信效率。
translated by 谷歌翻译
Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully devised yet simple spatial attention mechanism performs favorably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our Code is available at: https://git.io/Twins.
translated by 谷歌翻译
We propose a simple yet effective instance segmentation framework, termed CondInst (conditional convolutions for instance segmentation). Top-performing instance segmentation methods such as Mask R-CNN rely on ROI operations (typically ROIPool or ROIAlign) to obtain the final instance masks. In contrast, we propose to solve instance segmentation from a new perspective. Instead of using instancewise ROIs as inputs to a network of fixed weights, we employ dynamic instance-aware networks, conditioned on instances. CondInst enjoys two advantages: 1) Instance segmentation is solved by a fully convolutional network, eliminating the need for ROI cropping and feature alignment.2) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference. We demonstrate a simpler instance segmentation method that can achieve improved performance in both accuracy and inference speed. On the COCO dataset, we outperform a few recent methods including welltuned Mask R-CNN baselines, without longer training schedules needed.
translated by 谷歌翻译
We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), FCOS with ResNeXt-64x4d-101 achieves 44.7% in AP with single-model and single-scale testing, surpassing previous one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at:tinyurl.com/FCOSv1
translated by 谷歌翻译