Recently, segmentation-based methods are quite popular in scene text detection, which mainly contain two steps: text kernel segmentation and expansion. However, the segmentation process only considers each pixel independently, and the expansion process is difficult to achieve a favorable accuracy-speed trade-off. In this paper, we propose a Context-aware and Boundary-guided Network (CBN) to tackle these problems. In CBN, a basic text detector is firstly used to predict initial segmentation results. Then, we propose a context-aware module to enhance text kernel feature representations, which considers both global and local contexts. Finally, we introduce a boundary-guided module to expand enhanced text kernels adaptively with only the pixels on the contours, which not only obtains accurate text boundaries but also keeps high speed, especially on high-resolution output maps. In particular, with a lightweight backbone, the basic detector equipped with our proposed CBN achieves state-of-the-art results on several popular benchmarks, and our proposed CBN can be plugged into several segmentation-based methods. Code will be available on https://github.com/XiiZhao/cbn.pytorch.
translated by 谷歌翻译
Pseudo supervision is regarded as the core idea in semi-supervised learning for semantic segmentation, and there is always a tradeoff between utilizing only the high-quality pseudo labels and leveraging all the pseudo labels. Addressing that, we propose a novel learning approach, called Conservative-Progressive Collaborative Learning (CPCL), among which two predictive networks are trained in parallel, and the pseudo supervision is implemented based on both the agreement and disagreement of the two predictions. One network seeks common ground via intersection supervision and is supervised by the high-quality labels to ensure a more reliable supervision, while the other network reserves differences via union supervision and is supervised by all the pseudo labels to keep exploring with curiosity. Thus, the collaboration of conservative evolution and progressive exploration can be achieved. To reduce the influences of the suspicious pseudo labels, the loss is dynamic re-weighted according to the prediction confidence. Extensive experiments demonstrate that CPCL achieves state-of-the-art performance for semi-supervised semantic segmentation.
translated by 谷歌翻译
In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from images simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries is designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. Code will be availabl at \url{https://github.com/IDEA-Research/DQ-DETR}.
translated by 谷歌翻译
Salient object detection (SOD) focuses on distinguishing the most conspicuous objects in the scene. However, most related works are based on RGB images, which lose massive useful information. Accordingly, with the maturity of thermal technology, RGB-T (RGB-Thermal) multi-modality tasks attain more and more attention. Thermal infrared images carry important information which can be used to improve the accuracy of SOD prediction. To accomplish it, the methods to integrate multi-modal information and suppress noises are critical. In this paper, we propose a novel network called Interactive Context-Aware Network (ICANet). It contains three modules that can effectively perform the cross-modal and cross-scale fusions. We design a Hybrid Feature Fusion (HFF) module to integrate the features of two modalities, which utilizes two types of feature extraction. The Multi-Scale Attention Reinforcement (MSAR) and Upper Fusion (UF) blocks are responsible for the cross-scale fusion that converges different levels of features and generate the prediction maps. We also raise a novel Context-Aware Multi-Supervised Network (CAMSNet) to calculate the content loss between the prediction and the ground truth (GT). Experiments prove that our network performs favorably against the state-of-the-art RGB-T SOD methods.
translated by 谷歌翻译
Automated detecting lung infections from computed tomography (CT) data plays an important role for combating COVID-19. However, there are still some challenges for developing AI system. 1) Most current COVID-19 infection segmentation methods mainly relied on 2D CT images, which lack 3D sequential constraint. 2) Existing 3D CT segmentation methods focus on single-scale representations, which do not achieve the multiple level receptive field sizes on 3D volume. 3) The emergent breaking out of COVID-19 makes it hard to annotate sufficient CT volumes for training deep model. To address these issues, we first build a multiple dimensional-attention convolutional neural network (MDA-CNN) to aggregate multi-scale information along different dimension of input feature maps and impose supervision on multiple predictions from different CNN layers. Second, we assign this MDA-CNN as a basic network into a novel dual multi-scale mean teacher network (DM${^2}$T-Net) for semi-supervised COVID-19 lung infection segmentation on CT volumes by leveraging unlabeled data and exploring the multi-scale information. Our DM${^2}$T-Net encourages multiple predictions at different CNN layers from the student and teacher networks to be consistent for computing a multi-scale consistency loss on unlabeled data, which is then added to the supervised loss on the labeled data from multiple predictions of MDA-CNN. Third, we collect two COVID-19 segmentation datasets to evaluate our method. The experimental results show that our network consistently outperforms the compared state-of-the-art methods.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare.
translated by 谷歌翻译
光学计算是一种新兴技术,用于下一代高效人工智能(AI),其速度和效率超高。电磁场模拟对于光子设备和电路的设计,优化和验证至关重要。但是,昂贵的数值模拟显着阻碍了光子电路设计循环中的可扩展性和转环。最近,已经提出了物理信息的神经网络来预测具有预定义参数的部分微分方程(PDE)的单个实例的光场解。它们复杂的PDE公式和缺乏有效的参数化机制限制了其在实际模拟方案中的灵活性和概括。在这项工作中,首次提出了一个被称为Neurolight的物理敏捷神经操作员框架,以学习一个频率域的麦克斯韦PDE家族,以进行超快速的参数光子设备模拟。我们通过几种新技术来平衡神经照明的效率和概括。具体而言,我们将不同的设备离散到统一域中,代表具有紧凑型波的参数PDE,并通过掩盖的源建模编码入射光。我们使用参数效率高的跨形神经块设计模型,并采用基于叠加的增强来进行数据效率学习。通过这些协同方法,神经亮像可以概括为大量的看不见的模拟设置,比数值求解器显示了2个磁性的模拟速度,并且比先前的神经网络模型优于降低54%的预测误差,而降低了约44%的参数。 。我们的代码可在https://github.com/jeremiemelo/neurolight上找到。
translated by 谷歌翻译
图形卷积神经网络(GCN)吸引了越来越多的注意力,并在各种计算机视觉任务中取得了良好的表现,但是,对GCN的内部机制缺乏明确的解释。对于标准的卷积神经网络(CNN),通常使用类激活映射(CAM)方法通过生成热图来可视化CNN的决策和图像区域之间的连接。尽管如此,当这些凸轮直接应用于GCN时,这种热图通常会显示出语义 - chaos。在本文中,我们提出了一种新颖的可视化方法,特别适用于GCN,顶点语义类激活映射(VS-CAM)。 VS-CAM包括两个独立的管道,分别制作一组语义探针图和一个语义基映射。语义探针图用于检测语义信息从语义碱图图中的语义信息,以汇总语义感知的热图。定性结果表明,VS-CAM可以获得与基于CNN的CAM更精确地匹配对象的热图。定量评估进一步证明了VS-CAM的优势。
translated by 谷歌翻译
随着移动平台上对计算摄影和成像的需求不断增长,在相机系统中开发和集成了高级图像传感器与新型算法的发展。但是,缺乏用于研究的高质量数据以及从行业和学术界进行深入交流的难得的机会限制了移动智能摄影和成像(MIPI)的发展。为了弥合差距,我们介绍了第一个MIPI挑战,包括五个曲目,这些曲目着重于新型图像传感器和成像算法。在本文中,引入了RGBW关节Remosaic和Denoise,这是五个曲目之一,在全面分辨率上进行了RGBW CFA插值的插值。为参与者提供了一个新的数据集,其中包括70(培训)和15个(验证)高质量RGBW和拜耳对的场景。此外,对于每个场景,在0dB,24dB和42dB上提供了不同噪声水平的RGBW。所有数据均在室外和室内条件下使用RGBW传感器捕获。最终结果是使用PSNR,SSIM,LPIPS和KLD在内的客观指标评估的。本文提供了此挑战中所有模型的详细描述。有关此挑战的更多详细信息以及数据集的链接,请访问https://github.com/mipi-challenge/mipi2022。
translated by 谷歌翻译