它得到了很好的认识到,从深度感知的LIDAR点云和语义富有的立体图像中融合互补信息将有利于3D对象检测。然而,探索稀疏3D点和密集2D像素之间固有的不自然相互作用并不重要。为了简化这种困难,最近的建议通常将3D点投影到2D图像平面上以对图像数据进行采样,然后聚合点处的数据。然而,这种方法往往遭受点云和RGB图像的分辨率之间的不匹配,导致次优性能。具体地,作为多模态数据聚合位置的稀疏点导致高分辨率图像的严重信息丢失,这反过来破坏了多传感器融合的有效性。在本文中,我们呈现VPFNET - 一种新的架构,可以在“虚拟”点处巧妙地对齐和聚合点云和图像数据。特别地,它们的密度位于3D点和2D像素的密度之间,虚拟点可以很好地桥接两个传感器之间的分辨率间隙,从而保持更多信息以进行处理。此外,我们还研究了可以应用于点云和RGB图像的数据增强技术,因为数据增强对迄今为止对3D对象探测器的贡献不可忽略。我们对Kitti DataSet进行了广泛的实验,与最先进的方法相比,观察到了良好的性能。值得注意的是,我们的VPFNET在KITTI测试集上实现了83.21 \%中等3D AP和91.86 \%适度的BEV AP,自2021年5月21日起排名第一。网络设计也考虑了计算效率 - 我们可以实现FPS 15对单个NVIDIA RTX 2080TI GPU。该代码将用于复制和进一步调查。
translated by 谷歌翻译
Corals are the primary habitat-building life-form on reefs that support a quarter of the species in the ocean. A coral reef ecosystem usually consists of reefs, each of which is like a tall building in any city. These reef-building corals secrete hard calcareous exoskeletons that give them structural rigidity, and are also a prerequisite for our accurate 3D modeling and semantic mapping using advanced photogrammetric computer vision and machine learning. Underwater videography as a modern underwater remote sensing tool is a high-resolution coral habitat survey and mapping technique. In this paper, detailed 3D mesh models, digital surface models and orthophotos of the coral habitat are generated from the collected coral images and underwater control points. Meanwhile, a novel pixel-wise semantic segmentation approach of orthophotos is performed by advanced deep learning. Finally, the semantic map is mapped into 3D space. For the first time, 3D fine-grained semantic modeling and rugosity evaluation of coral reefs have been completed at millimeter (mm) accuracy. This provides a new and powerful method for understanding the processes and characteristics of coral reef change at high spatial and temporal resolution under climate change.
translated by 谷歌翻译
在基于变压器的模型中通常观察到令牌均匀性,在经过变压器中经过堆叠的多个自我发场层后,不同的令牌共享大量相似信息。在本文中,我们建议使用每个变压器层的输出的奇异值的分布来表征令牌均匀性的现象,并从经验上说明,偏斜的奇异值分布可以减轻“令牌均匀性”问题。基于我们的观察结果,我们定义了奇异值分布的几种理想特性,并提出了一种新的转换函数,以更新奇异值。我们表明,除了减轻令牌均匀性外,转换功能还应保留原始嵌入空间中的当地邻域结构。我们提出的奇异价值变换函数应用于伯特,阿尔伯特,罗伯塔和德文尔特等一系列基于变压器的语言模型,并且在语义文本相似性评估和一系列胶水任务中观察到了改善的性能。我们的源代码可在https://github.com/hanqi-qi/tokenuni.git上找到。
translated by 谷歌翻译
我们提出了VDL-Surogate,这是一种基于视图的神经网络贴属替代模型,用于集合模拟的参数空间探索,该模拟允许高分辨率可视化和用户指定的视觉映射。支持替代物的参数空间探索允许域科学家预览模拟结果,而无需运行大量计算成本的模拟。但是,受计算资源的限制,现有的替代模型可能无法产生以可视化和分析的足够分辨率的预览。为了提高计算资源的有效利用并支持高分辨率探索,我们从不同的角度进行射线铸造以收集样品并产生紧凑的潜在表示。这种潜在的编码过程降低了替代模型培训的成本,同时保持产出质量。在模型训练阶段,我们选择观点以覆盖整个观看球体,并为所选观点提供相应的VDL-Surrogate模型。在模型推理阶段,我们在先前选择的观点上预测潜在表示,并将潜在表示形式解码为数据空间。对于任何给定的观点,我们在选定的观点上对解码数据进行插值,并使用用户指定的视觉映射生成可视化。我们展示了VDL-Surogate在宇宙学和海洋模拟中的有效性和效率,并具有定量和定性评估。源代码可在\ url {https://github.com/trainsn/vdl-surrogate}上公开获得。
translated by 谷歌翻译
近年来,人们对开发自然语言处理(NLP)中可解释模型的利益越来越多。大多数现有模型旨在识别输入功能,例如对于模型预测而言重要的单词或短语。然而,在NLP中开发的神经模型通常以层次结构的方式构成单词语义,文本分类需要层次建模来汇总本地信息,以便处理主题和标签更有效地转移。因此,单词或短语的解释不能忠实地解释文本分类中的模型决策。本文提出了一种新型的层次解释性神经文本分类器,称为提示,该分类器可以自动以层次结构方式以标记相关主题的形式生成模型预测的解释。模型解释不再处于单词级别,而是基于主题作为基本语义单元。评论数据集和新闻数据集的实验结果表明,我们所提出的方法与现有最新的文本分类器相当地达到文本分类结果,并比其他可解释的神经文本更忠实于模型的预测和更好地理解人类的解释分类器。
translated by 谷歌翻译
根据国家学院,每周速度,垂直结构和环流电流(LC)的持续时间及其漩涡的预测对于了解海洋学和生态系统,以及减轻墨西哥湾的人为和自然灾害的结果至关重要(GOM)。然而,这一预测是一个具有挑战性的问题,因为LC行为由多个时间尺度的远程空间连接主导。在本文中,我们扩展了时空预测学习,将其效力显示为超越视频预测,到4D模型,即用于3D地理空间预测的时间序列的新型物理知识的张力列车Convlstm(Pitt-convlstm)。具体而言,我们提出1)一种新的4D高阶经复制神经网络,具有经验正交函数分析,以捕获每个层次结构的隐藏不相关的模式,2)卷积的张力串分解,以捕获更高阶的时空相关性,3 )通过向域专家提供从域专家提供的现有物理知识,以便在潜在空间中通知学习。我们提出的方法的优点是显而易见的:通过物理定律的限制,它同时学习每个时间框架内帧的依赖性(包括短期和长期的高层次的依赖)和跨层级关系良好的表示。从GOM收集的地理空间数据的实验表明,PITT-COMMLSTM在预测LC的体积速度及其漩涡的时间内超过一周内的最先进的方法。
translated by 谷歌翻译
Benefiting from the intrinsic supervision information exploitation capability, contrastive learning has achieved promising performance in the field of deep graph clustering recently. However, we observe that two drawbacks of the positive and negative sample construction mechanisms limit the performance of existing algorithms from further improvement. 1) The quality of positive samples heavily depends on the carefully designed data augmentations, while inappropriate data augmentations would easily lead to the semantic drift and indiscriminative positive samples. 2) The constructed negative samples are not reliable for ignoring important clustering information. To solve these problems, we propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) by mining the intrinsic supervision information in the high-confidence clustering results. Specifically, instead of conducting complex node or edge perturbation, we construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. Then, guided by the high-confidence clustering information, we carefully select and construct the positive samples from the same high-confidence cluster in two views. Moreover, to construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples, thus improving the discriminative capability and reliability of the constructed sample pairs. Lastly, we design an objective function to pull close the samples from the same cluster while pushing away those from other clusters by maximizing and minimizing the cross-view cosine similarity between positive and negative samples. Extensive experimental results on six datasets demonstrate the effectiveness of CCGC compared with the existing state-of-the-art algorithms.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译