When using LiDAR semantic segmentation models for safety-critical applications such as autonomous driving, it is essential to understand and improve their robustness with respect to a large range of LiDAR corruptions. In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions. To rigorously evaluate the robustness and generalizability of current approaches, we propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy. Then, we systematically investigate 11 LiDAR semantic segmentation models, especially spanning different input representations (e.g., point clouds, voxels, projected images, and etc.), network architectures and training schemes. Through this study, we obtain two insights: 1) We find out that the input representation plays a crucial role in robustness. Specifically, under specific corruptions, different representations perform variously. 2) Although state-of-the-art methods on LiDAR semantic segmentation achieve promising results on clean data, they are less robust when dealing with noisy data. Finally, based on the above observations, we design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications. It is promising that our benchmark, comprehensive analysis, and observations can boost future research in robust LiDAR semantic segmentation for safety-critical applications.
translated by 谷歌翻译
Gaussian process state-space model (GPSSM) is a fully probabilistic state-space model that has attracted much attention over the past decade. However, the outputs of the transition function in the existing GPSSMs are assumed to be independent, meaning that the GPSSMs cannot exploit the inductive biases between different outputs and lose certain model capacities. To address this issue, this paper proposes an output-dependent and more realistic GPSSM by utilizing the well-known, simple yet practical linear model of coregionalization (LMC) framework to represent the output dependency. To jointly learn the output-dependent GPSSM and infer the latent states, we propose a variational sparse GP-based learning method that only gently increases the computational complexity. Experiments on both synthetic and real datasets demonstrate the superiority of the output-dependent GPSSM in terms of learning and inference performance.
translated by 谷歌翻译
The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner. This way often leads to severe performance degradation when they try to extrapolate a longer period of future, thus limiting the practical use of the prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursive manner and therefore prevents error accumulation. However, only a few MIMO models for video prediction are proposed and they only achieve inferior performance due to the date. The real strength of the MIMO model in this area is not well noticed and is largely under-explored. Motivated by that, we conduct a comprehensive investigation in this paper to thoroughly exploit how far a simple MIMO architecture can go. Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with longterm error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality. We believe our model can serve as a new baseline to facilitate the future research of video prediction tasks. The code will be released.
translated by 谷歌翻译
Accurate polyp segmentation is of great importance for colorectal cancer diagnosis and treatment. However, due to the high cost of producing accurate mask annotations, existing polyp segmentation methods suffer from severe data shortage and impaired model generalization. Reversely, coarse polyp bounding box annotations are more accessible. Thus, in this paper, we propose a boosted BoxPolyp model to make full use of both accurate mask and extra coarse box annotations. In practice, box annotations are applied to alleviate the over-fitting issue of previous polyp segmentation models, which generate fine-grained polyp area through the iterative boosted segmentation model. To achieve this goal, a fusion filter sampling (FFS) module is firstly proposed to generate pixel-wise pseudo labels from box annotations with less noise, leading to significant performance improvements. Besides, considering the appearance consistency of the same polyp, an image consistency (IC) loss is designed. Such IC loss explicitly narrows the distance between features extracted by two different networks, which improves the robustness of the model. Note that our BoxPolyp is a plug-and-play model, which can be merged into any appealing backbone. Quantitative and qualitative experimental results on five challenging benchmarks confirm that our proposed model outperforms previous state-of-the-art methods by a large margin.
translated by 谷歌翻译
Measuring and alleviating the discrepancies between the synthetic (source) and real scene (target) data is the core issue for domain adaptive semantic segmentation. Though recent works have introduced depth information in the source domain to reinforce the geometric and semantic knowledge transfer, they cannot extract the intrinsic 3D information of objects, including positions and shapes, merely based on 2D estimated depth. In this work, we propose a novel Geometry-Aware Network for Domain Adaptation (GANDA), leveraging more compact 3D geometric point cloud representations to shrink the domain gaps. In particular, we first utilize the auxiliary depth supervision from the source domain to obtain the depth prediction in the target domain to accomplish structure-texture disentanglement. Beyond depth estimation, we explicitly exploit 3D topology on the point clouds generated from RGB-D images for further coordinate-color disentanglement and pseudo-labels refinement in the target domain. Moreover, to improve the 2D classifier in the target domain, we perform domain-invariant geometric adaptation from source to target and unify the 2D semantic and 3D geometric segmentation results in two domains. Note that our GANDA is plug-and-play in any existing UDA framework. Qualitative and quantitative results demonstrate that our model outperforms state-of-the-arts on GTA5->Cityscapes and SYNTHIA->Cityscapes.
translated by 谷歌翻译
Dialogue state tracking (DST) aims to convert the dialogue history into dialogue states which consist of slot-value pairs. As condensed structural information memorizing all history information, the dialogue state in the last turn is typically adopted as the input for predicting the current state by DST models. However, these models tend to keep the predicted slot values unchanged, which is defined as state momentum in this paper. Specifically, the models struggle to update slot values that need to be changed and correct wrongly predicted slot values in the last turn. To this end, we propose MoNET to tackle state momentum via noise-enhanced training. First, the previous state of each turn in the training data is noised via replacing some of its slot values. Then, the noised previous state is used as the input to learn to predict the current state, improving the model's ability to update and correct slot values. Furthermore, a contrastive context matching framework is designed to narrow the representation distance between a state and its corresponding noised variant, which reduces the impact of noised state and makes the model better understand the dialogue history. Experimental results on MultiWOZ datasets show that MoNET outperforms previous DST methods. Ablations and analysis verify the effectiveness of MoNET in alleviating state momentum and improving anti-noise ability.
translated by 谷歌翻译
本文考虑通过模型量化提高联邦学习(FL)的无线通信和计算效率。在提出的Bitwidth FL方案中,Edge设备将其本地FL模型参数的量化版本训练并传输到协调服务器,从而将它们汇总为量化的全局模型并同步设备。目的是共同确定用于本地FL模型量化的位宽度以及每次迭代中参与FL训练的设备集。该问题被视为一个优化问题,其目标是在每卷工具采样预算和延迟要求下最大程度地减少量化FL的训练损失。为了得出解决方案,进行分析表征,以显示有限的无线资源和诱导的量化误差如何影响所提出的FL方法的性能。分析结果表明,两个连续迭代之间的FL训练损失的改善取决于设备的选择和量化方案以及所学模型固有的几个参数。给定基于线性回归的这些模型属性的估计值,可以证明FL训练过程可以描述为马尔可夫决策过程(MDP),然后提出了基于模型的增强学习(RL)方法来优化动作的方法选择迭代。与无模型RL相比,这种基于模型的RL方法利用FL训练过程的派生数学表征来发现有效的设备选择和量化方案,而无需强加其他设备通信开销。仿真结果表明,与模型无RL方法和标准FL方法相比,提出的FL算法可以减少29%和63%的收敛时间。
translated by 谷歌翻译
由于其稀疏和细长的性质,估算3D空间中准确的车道线仍然具有挑战性。在这项工作中,我们提出了M^2-3dlanenet,这是一个有效3D车道检测的多模式框架。旨在集成来自多传感器的互补信息,M^2-3dlanenet首先将多模式特征提取具有模态特异性骨架,然后将它们融合在统一的鸟眼视图(BEV)空间中。具体而言,我们的方法由两个核心组成部分组成。 1)要获得准确的2D-3D映射,我们提出了自上而下的BEV生成。其中,使用线条限制的变形(LRDA)模块可用于以自上而下的方式有效地增强图像特征,从而充分捕获车道的细长特征。之后,它使用深度感知的举重将2D锥体特征投入到3D空间中,并通过枕形生成BEV特征。 2)我们进一步提出了自下而上的BEV融合,该融合通过多尺度的级联注意力汇总了多模式特征,从而集成了来自摄像头和激光雷达传感器的互补信息。足够的实验证明了M^2-3dlanenet的有效性,该实验的有效性超过了先前的最先进方法,即在OpenLane数据集上提高了12.1%的F1-SCORE改善。
translated by 谷歌翻译
在本文中,提出了用于文本数据传输的语义通信框架。在研究的模型中,基站(BS)从文本数据中提取语义信息,并将其传输到每个用户。语义信息由由一组语义三元组组成的知识图(kg)建模。收到语义信息后,每个用户都使用图形到文本生成模型恢复原始文本。为了衡量所考虑的语义通信框架的性能,提出了共同捕获恢复文本的语义准确性和完整性的语义相似性(MSS)的指标。由于无线资源限制,BS可能无法将整个语义信息传输给每个用户并满足传输延迟约束。因此,BS必须为每个用户选择适当的资源块,并确定和将一部分语义信息传输给用户。因此,我们制定了一个优化问题,其目标是通过共同优化资源分配策略并确定要传输的部分语义信息来最大化总MSS。为了解决这个问题,提出了与注意力网络集成的基于近端优化的强化增强学习(RL)算法。所提出的算法可以使用注意网络在语义信息中评估每个三重组的重要性,然后在语义信息中三元组的重要性分布与总MSS之间建立关系。与传统的RL算法相比,所提出的算法可以动态调整其学习率,从而确保收敛到本地最佳解决方案。
translated by 谷歌翻译
个性化联合学习(FL)促进了多个客户之间的合作,以学习个性化模型而无需共享私人数据。该机制减轻了系统中通常遇到的统计异质性,即不同客户端的非IID数据。现有的个性化算法通常假设所有客户自愿进行个性化。但是,潜在的参与者可能仍然不愿个性化模型,因为他们可能无法正常工作。在这种情况下,客户选择使用全局模型。为了避免做出不切实际的假设,我们介绍了个性化率,该率是愿意培训个性化模型,将其介绍给联合设置并提出DYPFL的客户的比例。这种动态个性化的FL技术激励客户参与个性化本地模型,同时允许在整体模型表现更好时采用全球模型。我们表明,DYPFL中的算法管道可以保证良好的收敛性能,从而使其在广泛的条件下优于替代性个性化方法,包括异质性,客户端数量,本地时期和批量尺寸的变化。
translated by 谷歌翻译