视觉变压器(VITS)最近在各种视觉任务中展示了最先进的性能,更换了卷积神经网络(CNN)。同时,由于VIT具有比CNN不同的结构,因此它可能表现不同。探讨VIT的可靠性,本文研究了VIT的行为和稳健性。我们通过假设可能出现在实际视觉任务中的各种图像损坏来比较CNN和Vit的稳健性。我们确认,对于大多数图像转换,Vit显示出与CNN或更高的鲁棒性相当。然而,对于对比增强,在Vit中一直观察到严重的性能降解。从详细分析中,我们确定了潜在的问题:在颜色比例变化时,韦特的贴片嵌入中的位置嵌入可能不正确地工作。在这里,我们声称使用PRELAYORNOM,修改后的贴片嵌入结构,以确保VIT的鳞片不变行为。 PRELAYORMOM的VIT显示在包括对比度不同环境的各种腐败中的鲁棒性。
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Most scanning LiDAR sensors generate a sequence of point clouds in real-time. While conventional 3D object detectors use a set of unordered LiDAR points acquired over a fixed time interval, recent studies have revealed that substantial performance improvement can be achieved by exploiting the spatio-temporal context present in a sequence of LiDAR point sets. In this paper, we propose a novel 3D object detection architecture, which can encode LiDAR point cloud sequences acquired by multiple successive scans. The encoding process of the point cloud sequence is performed on two different time scales. We first design a short-term motion-aware voxel encoding that captures the short-term temporal changes of point clouds driven by the motion of objects in each voxel. We also propose long-term motion-guided bird's eye view (BEV) feature enhancement that adaptively aligns and aggregates the BEV feature maps obtained by the short-term voxel encoding by utilizing the dynamic motion context inferred from the sequence of the feature maps. The experiments conducted on the public nuScenes benchmark demonstrate that the proposed 3D object detector offers significant improvements in performance compared to the baseline methods and that it sets a state-of-the-art performance for certain 3D object detection categories. Code is available at https://github.com/HYjhkoh/MGTANet.git
translated by 谷歌翻译
Predicting the future motion of dynamic agents is of paramount importance to ensure safety or assess risks in motion planning for autonomous robots. In this paper, we propose a two-stage motion prediction method, referred to as R-Pred, that effectively utilizes both the scene and interaction context using a cascade of the initial trajectory proposal network and the trajectory refinement network. The initial trajectory proposal network produces M trajectory proposals corresponding to M modes of a future trajectory distribution. The trajectory refinement network enhances each of M proposals using 1) the tube-query scene attention (TQSA) and 2) the proposal-level interaction attention (PIA). TQSA uses tube-queries to aggregate the local scene context features pooled from proximity around the trajectory proposals of interest. PIA further enhances the trajectory proposals by modeling inter-agent interactions using a group of trajectory proposals selected based on their distances from neighboring agents. Our experiments conducted on the Argoverse and nuScenes datasets demonstrate that the proposed refinement network provides significant performance improvements compared to the single-stage baseline and that R-Pred achieves state-of-the-art performance in some categories of the benchmark.
translated by 谷歌翻译
与LIDAR相比,相机和雷达传感器在成本,可靠性和维护方面具有显着优势。现有的融合方法通常融合了结果级别的单个模式的输出,称为后期融合策略。这可以通过使用现成的单传感器检测算法受益,但是晚融合无法完全利用传感器的互补特性,因此尽管相机雷达融合的潜力很大,但性能有限。在这里,我们提出了一种新颖的提案级早期融合方法,该方法有效利用了相机和雷达的空间和上下文特性,用于3D对象检测。我们的融合框架首先将图像建议与极坐标系中的雷达点相关联,以有效处理坐标系和空间性质之间的差异。将其作为第一阶段,遵循连续的基于交叉注意的特征融合层在相机和雷达之间自适应地交换时尚信息,从而导致强大而专心的融合。我们的摄像机雷达融合方法可在Nuscenes测试集上获得最新的41.1%地图,而NDS则达到52.3%,比仅摄像机的基线高8.7和10.8点,并在竞争性能上提高竞争性能LIDAR方法。
translated by 谷歌翻译
我们提出了一种新型算法,用于单眼深度估计,将度量深度图分解为归一化的深度图和尺度特征。所提出的网络由共享编码器和三个解码器组成,称为G-NET,N-NET和M-NET,它们分别估算了梯度图,归一化的深度图和度量深度图。M-NET学习使用G-NET和N-NET提取的相对深度特征更准确地估算度量深度。所提出的算法具有一个优点,即它可以使用无度量深度标签的数据集来提高度量深度估计的性能。各种数据集的实验结果表明,所提出的算法不仅为最先进的算法提供竞争性能,而且即使只有少量的度量深度数据可用于培训,也会产生可接受的结果。
translated by 谷歌翻译
手语制作(SLP)旨在将语言的表达方式转化为手语的相应语言,例如基于骨架的标志姿势或视频。现有的SLP型号是自动回旋(AR)或非自动入口(NAR)。但是,AR-SLP模型在解码过程中遭受了回归对均值和误差传播的影响。 NSLP-G是一种基于NAR的模型,在某种程度上解决了这些问题,但会带来其他问题。例如,它不考虑目标符号长度,并且会遭受虚假解码启动的影响。我们通过知识蒸馏(KD)提出了一种新型的NAR-SLP模型,以解决这些问题。首先,我们设计一个长度调节器来预测生成的符号姿势序列的末端。然后,我们采用KD,该KD从预训练的姿势编码器中提取空间语言特征以减轻虚假解码的启动。广泛的实验表明,所提出的方法在特里切特的手势距离和背面翻译评估中都显着优于现有的SLP模型。
translated by 谷歌翻译
已经研究了预测听众平均意见评分(MOS)的自动方法,以确保文本到语音系统的质量。许多先前的研究都集中在建筑进步(例如MBNET,LDNET等)上,以更有效的方式捕获光谱特征和MOS之间的关系,并获得了高精度。但是,从概括能力方面的最佳表示仍在很大程度上仍然未知。为此,我们比较了WAV2VEC框架获得的自我监督学习(SSL)特征与光谱特征(例如光谱图和Melspectrogron的幅度)的性能。此外,我们建议将SSL功能和功能结合起来,我们认为我们认为将基本信息保留到自动MOS上,以相互补偿其缺点。我们对从过去的暴风雪和语音转换挑战中收集的大规模听力测试语料库进行了全面的实验。我们发现,即使给定的地面真相并不总是可靠,WAV2VEC功能集也显示出最佳的概括。此外,我们发现组合表现最好,并分析了它们如何弥合光谱和WAV2VEC特征集之间的差距。
translated by 谷歌翻译
尽管化学实验室中基于机器人的自动化可以加速材料开发过程,但无监视的环境可能主要是由于机器控制误差而导致的危险事故。对象检测技术可以在解决这些安全问题方面发挥至关重要的作用;但是,包括单杆检测器(SSD)模型在内的最先进的探测器在涉及复杂和嘈杂场景的环境中的精度不足。为了改善无监视实验室的安全性,我们报告了一种新颖的深度学习(DL)基于对象探测器,即Densessd。对于检测小瓶位置的首要问题和频繁的问题,根据涉及空和溶液填充的小瓶的复杂数据集的平均平均精度(MAP)超过95%,大大超过了传统探测器的平均精度(MAP)。如此高的精度对于最大程度地减少故障引起的事故至关重要。此外,观察到致密的对环境变化高度不敏感,在溶液颜色或测试视图角度的变化下保持其高精度。致密性的稳健性将使使用的设备设置更加灵活。这项工作表明,密集是在自动化材料合成环境中提高安全性很有用,并且可以扩展到需要高检测精度和速度的各种应用。
translated by 谷歌翻译
在本文中,我们提出了一种基于相机和激光雷达传感器的3D对象检测和跟踪的新的联合对象检测和跟踪(Jodt)框架。所提出的方法称为3D Depectrack,使得检测器和跟踪器能够协作以产生相机和LIDAR数据的时空表示,然后执行3D对象检测和跟踪。检测器通过通过相机和激光乐融合获得的空间特征的加权时间聚集构建时空特征。然后,检测器使用从Roadklet的信息重新配置初始检测结果,从而保持到先前的时间步长。基于由探测器产生的时空特征,跟踪器使用图形神经网络(GNN)将检测到的对象与先前跟踪的对象相关联。我们通过基于规则的边缘修剪和关注的边缘门控的组合设计了一个完全连接的GNN,它利用空间和时间对象上下文来提高跟踪性能。在基准和NUSCENES基准上进行的实验表明,所提出的3D Depectrack在基线方法上的检测和跟踪性能方面实现了显着的改进,并通过检测器和跟踪器之间的协作实现现有方法的最新性能。
translated by 谷歌翻译