Semantic segmentation usually benefits from global contexts, fine localisation information, multi-scale features, etc. To advance Transformer-based segmenters with these aspects, we present a simple yet powerful semantic segmentation architecture, termed as IncepFormer. IncepFormer has two critical contributions as following. First, it introduces a novel pyramid structured Transformer encoder which harvests global context and fine localisation features simultaneously. These features are concatenated and fed into a convolution layer for final per-pixel prediction. Second, IncepFormer integrates an Inception-like architecture with depth-wise convolutions, and a light-weight feed-forward module in each self-attention layer, efficiently obtaining rich local multi-scale object features. Extensive experiments on five benchmarks show that our IncepFormer is superior to state-of-the-art methods in both accuracy and speed, e.g., 1) our IncepFormer-S achieves 47.7% mIoU on ADE20K which outperforms the existing best method by 1% while only costs half parameters and fewer FLOPs. 2) Our IncepFormer-B finally achieves 82.0% mIoU on Cityscapes dataset with 39.6M parameters. Code is available:github.com/shendu0321/IncepFormer.
translated by 谷歌翻译
本文介绍了我们针对六个基本表达分类的方法论情感行为分析(ABAW)竞赛2022年的曲目。从人为生成的数据中表达并概括为真实数据。由于合成数据和面部动作单元(AU)的客观性的模棱两可,我们求助于AU信息以提高性能,并做出如下贡献。首先,为了使模型适应合成场景,我们使用了预先训练的大规模面部识别数据中的知识。其次,我们提出了一个概念上的框架,称为Au-persuped卷积视觉变压器(AU-CVT),该框架通过与AU或Pseudo Au标签共同训练辅助数据集来显然改善了FER的性能。我们的AU-CVT在验证集上的F1分数为0.6863美元,准确性为$ 0.7433 $。我们工作的源代码在线公开可用:https://github.com/msy1412/abaw4
translated by 谷歌翻译
视频中的自动烟熏车辆检测是用于传统昂贵的遥感遥控器,其中具有紫外线的紫外线设备,用于环境保护机构。但是,将车辆烟雾与后车辆或混乱道路的阴影和湿区域区分开来是一项挑战,并且由于注释数据有限,可能会更糟。在本文中,我们首先引入了一个现实世界中的大型烟熏车数据集,其中有75,000个带注释的烟熏车像图像,从而有助于对先进的深度学习模型进行有效的培训。为了启用公平算法比较,我们还构建了一个烟熏车视频数据集,其中包括163个带有细分级注释的长视频。此外,我们提出了一个新的粗到烟熏车辆检测(代码)框架,以进行有效的烟熏车辆检测。这些代码首先利用轻质的Yolo检测器以高召回率进行快速烟雾检测,然后采用烟极车匹配策略来消除非车辆烟雾,并最终使用精心设计的3D模型进一步完善结果,以进一步完善结果。空间时间空间。四个指标的广泛实验表明,我们的框架比基于手工的特征方法和最新的高级方法要优越。代码和数据集将在https://github.com/pengxj/smokyvehicle上发布。
translated by 谷歌翻译
最近3D点云学习一直是计算机视觉和自主驾驶中的热门话题。由于事实上,难以手动注释一个定性的大型3D点云数据集,无监督的域适应(UDA)在3D点云学习中流行,旨在将学习知识从标记的源域转移到未标记的目标领域。然而,具有简单学习模型引起的域转移引起的泛化和重建误差是不可避免的,这基本上阻碍了模型的学习良好表示的能力。为了解决这些问题,我们提出了一个结束到底自组合网络(SEN),用于3D云域适应任务。一般来说,我们的森林度假前的含义教师和半监督学习的优势,并引入了软的分类损失和一致性损失,旨在实现一致的泛化和准确的重建。在森中,学生网络以具有监督的学习和自我监督学习的协作方式,教师网络进行时间一致性,以学习有用的表示,并确保点云重建的质量。在几个3D点云UDA基准上的广泛实验表明,我们的SEN在分类和分段任务中表现出最先进的方法。此外,进一步的分析表明,我们的森也实现了更好的重建结果。
translated by 谷歌翻译
In recent years, arbitrary image style transfer has attracted more and more attention. Given a pair of content and style images, a stylized one is hoped that retains the content from the former while catching style patterns from the latter. However, it is difficult to simultaneously keep well the trade-off between the content details and the style features. To stylize the image with sufficient style patterns, the content details may be damaged and sometimes the objects of images can not be distinguished clearly. For this reason, we present a new transformer-based method named STT for image style transfer and an edge loss which can enhance the content details apparently to avoid generating blurred results for excessive rendering on style features. Qualitative and quantitative experiments demonstrate that STT achieves comparable performance to state-of-the-art image style transfer methods while alleviating the content leak problem.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
Machine learning models are typically evaluated by computing similarity with reference annotations and trained by maximizing similarity with such. Especially in the bio-medical domain, annotations are subjective and suffer from low inter- and intra-rater reliability. Since annotations only reflect the annotation entity's interpretation of the real world, this can lead to sub-optimal predictions even though the model achieves high similarity scores. Here, the theoretical concept of Peak Ground Truth (PGT) is introduced. PGT marks the point beyond which an increase in similarity with the reference annotation stops translating to better Real World Model Performance (RWMP). Additionally, a quantitative technique to approximate PGT by computing inter- and intra-rater reliability is proposed. Finally, three categories of PGT-aware strategies to evaluate and improve model performance are reviewed.
translated by 谷歌翻译
We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods.
translated by 谷歌翻译
Collaboration among industrial Internet of Things (IoT) devices and edge networks is essential to support computation-intensive deep neural network (DNN) inference services which require low delay and high accuracy. Sampling rate adaption which dynamically configures the sampling rates of industrial IoT devices according to network conditions, is the key in minimizing the service delay. In this paper, we investigate the collaborative DNN inference problem in industrial IoT networks. To capture the channel variation and task arrival randomness, we formulate the problem as a constrained Markov decision process (CMDP). Specifically, sampling rate adaption, inference task offloading and edge computing resource allocation are jointly considered to minimize the average service delay while guaranteeing the long-term accuracy requirements of different inference services. Since CMDP cannot be directly solved by general reinforcement learning (RL) algorithms due to the intractable long-term constraints, we first transform the CMDP into an MDP by leveraging the Lyapunov optimization technique. Then, a deep RL-based algorithm is proposed to solve the MDP. To expedite the training process, an optimization subroutine is embedded in the proposed algorithm to directly obtain the optimal edge computing resource allocation. Extensive simulation results are provided to demonstrate that the proposed RL-based algorithm can significantly reduce the average service delay while preserving long-term inference accuracy with a high probability.
translated by 谷歌翻译
The traditional statistical inference is static, in the sense that the estimate of the quantity of interest does not affect the future evolution of the quantity. In some sequential estimation problems however, the future values of the quantity to be estimated depend on the estimate of its current value. This type of estimation problems has been formulated as the dynamic inference problem. In this work, we formulate the Bayesian learning problem for dynamic inference, where the unknown quantity-generation model is assumed to be randomly drawn according to a random model parameter. We derive the optimal Bayesian learning rules, both offline and online, to minimize the inference loss. Moreover, learning for dynamic inference can serve as a meta problem, such that all familiar machine learning problems, including supervised learning, imitation learning and reinforcement learning, can be cast as its special cases or variants. Gaining a good understanding of this unifying meta problem thus sheds light on a broad spectrum of machine learning problems as well.
translated by 谷歌翻译