Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are $\mathbf{1.82\sim 10.91\%}$ mAP in BEV and $\mathbf{1.18\sim 9.36\%}$ mAP in 3D}. Codes have been released at https://github.com/mrsempress/OBMO.
translated by 谷歌翻译
As the basis for prehensile manipulation, it is vital to enable robots to grasp as robustly as humans. In daily manipulation, our grasping system is prompt, accurate, flexible and continuous across spatial and temporal domains. Few existing methods cover all these properties for robot grasping. In this paper, we propose a new methodology for grasp perception to enable robots these abilities. Specifically, we develop a dense supervision strategy with real perception and analytic labels in the spatial-temporal domain. Additional awareness of objects' center-of-mass is incorporated into the learning process to help improve grasping stability. Utilization of grasp correspondence across observations enables dynamic grasp tracking. Our model, AnyGrasp, can generate accurate, full-DoF, dense and temporally-smooth grasp poses efficiently, and works robustly against large depth sensing noise. Embedded with AnyGrasp, we achieve a 93.3% success rate when clearing bins with over 300 unseen objects, which is comparable with human subjects under controlled conditions. Over 900 MPPH is reported on a single-arm system. For dynamic grasping, we demonstrate catching swimming robot fish in the water.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
3D point cloud registration is a fundamental problem in computer vision and robotics. Recently, learning-based point cloud registration methods have made great progress. However, these methods are sensitive to outliers, which lead to more incorrect correspondences. In this paper, we propose a novel deep graph matching-based framework for point cloud registration. Specifically, we first transform point clouds into graphs and extract deep features for each point. Then, we develop a module based on deep graph matching to calculate a soft correspondence matrix. By using graph matching, not only the local geometry of each point but also its structure and topology in a larger range are considered in establishing correspondences, so that more correct correspondences are found. We train the network with a loss directly defined on the correspondences, and in the test stage the soft correspondences are transformed into hard one-to-one correspondences so that registration can be performed by a correspondence-based solver. Furthermore, we introduce a transformer-based method to generate edges for graph construction, which further improves the quality of the correspondences. Extensive experiments on object-level and scene-level benchmark datasets show that the proposed method achieves state-of-the-art performance. The code is available at: \href{https://github.com/fukexue/RGM}{https://github.com/fukexue/RGM}.
translated by 谷歌翻译
卷积神经网络(CNN)已在许多计算机视觉任务中广泛使用。但是,CNN具有固定的接收场,并且缺乏远程感知的能力,这对于人类的姿势估计至关重要。由于其能够捕获像素之间的远程依赖性的能力,因此最近对计算机视觉应用程序采用了变压器体系结构,并被证明是一种高效的体系结构。我们有兴趣探索其在人类姿势估计中的能力,因此提出了一个基于变压器结构的新型模型,并通过特征金字塔融合结构增强了。更具体地说,我们使用预训练的Swin变压器作为主链,并从输入图像中提取特征,我们利用特征金字塔结构从不同阶段提取特征图。通过将功能融合在一起,我们的模型可以预测关键点热图。我们研究的实验结果表明,与最新的基于CNN的模型相比,提出的基于变压器的模型可以实现更好的性能。
translated by 谷歌翻译
在广泛的实用应用中,需要进行远程感知的城市场景图像的语义细分,例如土地覆盖地图,城市变化检测,环境保护和经济评估。在深度学习技术的快速发展,卷积神经网络(CNN)的迅速发展。 )多年来一直在语义细分中占主导地位。 CNN采用层次特征表示,证明了局部信息提取的强大功能。但是,卷积层的本地属性限制了网络捕获全局上下文。最近,作为计算机视觉领域的热门话题,Transformer在全球信息建模中展示了其巨大的潜力,从而增强了许多与视觉相关的任务,例如图像分类,对象检测,尤其是语义细分。在本文中,我们提出了一个基于变压器的解码器,并为实时城市场景细分构建了一个类似Unet的变压器(UneTformer)。为了有效的分割,不显示器将轻量级RESNET18选择作为编码器,并开发出有效的全球关注机制,以模拟解码器中的全局和局部信息。广泛的实验表明,我们的方法不仅运行速度更快,而且与最先进的轻量级模型相比,其准确性更高。具体而言,拟议的未显示器分别在无人机和洛夫加数据集上分别达到了67.8%和52.4%的MIOU,而在单个NVIDIA GTX 3090 GPU上输入了512x512输入的推理速度最多可以达到322.4 fps。在进一步的探索中,拟议的基于变压器的解码器与SWIN变压器编码器结合使用,还可以在Vaihingen数据集上实现最新的结果(91.3%F1和84.1%MIOU)。源代码将在https://github.com/wanglibo1995/geoseg上免费获得。
translated by 谷歌翻译
具有编码器解码器架构的全卷积网络(FCN)是语义分段的标准范例。编码器 - 解码器架构利用编码器来捕获多级特征映射,其被解码器结合到最终预测中。随着上下文对于精确分割至关重要,已经提出了以智能方式提取此类信息的巨大努力,包括采用扩张/不受欢迎的卷曲或插入注意模块。但是,这些努力都基于与Reset或其他底座的FCN架构,它不能完全利用理论概念的上下文。相比之下,我们提出了Swin变压器作为骨干,以提取上下文信息并设计密集连接的特征聚合模块(DCFAM)的新型解码器,以恢复分辨率并产生分割图。两个遥感语义分割数据集的实验结果证明了提出方案的有效性。
translated by 谷歌翻译
在像素级别的特定类别分配地理空间对象是遥感图像分析中的基本任务。随着传感器技术的快速发展,可以在多个空间分辨率(MSR)中捕获远程感测图像,信息内容显示在不同的尺度上。从这些MSR图像中提取信息表示增强特征表示和表征的巨大机会。但是,MSR图像遭受了两个关键问题:1)地理对象的比例变化和2)在粗略空间分辨率下丢失详细信息。为了弥合这些差距,在本文中,我们提出了一种用于MSR远程感知图像的语义细分的新型刻度感知神经网络(SANET)。 SANET部署了密集连接的特征网络(DCFFM)模块,以捕获高质量的多尺度上下文,使得刻度变化正确地处理,并且对于大型和小物体而增加分割质量。空间特征重新校准(SFRM)模块进一步结合到网络中以学习具有增强的空间关系的完整语义内容,其中删除了信息丢失的负面影响。 DCFFM和SFRM的组合允许SANET学习尺度感知功能表示,这胜过现有的多尺度特征表示。三个语义分割数据集的广泛实验证明了拟议的Sanet在跨分辨率细分中的有效性。
translated by 谷歌翻译
Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could support increased cashew production and poverty alleviation. By leveraging 2.4-m Planet Basemaps and 0.5-m aerial imagery, newly developed deep learning algorithms, and large-scale ground truth datasets, we successfully produced the first national map of cashew in Benin and characterized the expansion of cashew plantations between 2015 and 2021. In particular, we developed a SpatioTemporal Classification with Attention (STCA) model to map the distribution of cashew plantations, which can fully capture texture information from discriminative time steps during a growing season. We further developed a Clustering Augmented Self-supervised Temporal Classification (CASTC) model to distinguish high-density versus low-density cashew plantations by automatic feature extraction and optimized clustering. Results show that the STCA model has an overall accuracy of 80% and the CASTC model achieved an overall accuracy of 77.9%. We found that the cashew area in Benin has doubled from 2015 to 2021 with 60% of new plantation development coming from cropland or fallow land, while encroachment of cashew plantations into protected areas has increased by 70%. Only half of cashew plantations were high-density in 2021, suggesting high potential for intensification. Our study illustrates the power of combining high-resolution remote sensing imagery and state-of-the-art deep learning algorithms to better understand tree crops in the heterogeneous smallholder landscape.
translated by 谷歌翻译
Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.
translated by 谷歌翻译