In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal research and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the offline evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. ASAP project page: https://github.com/JeffWang987/ASAP.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
否决单图是一项普遍但又具有挑战性的任务。复杂的降雪降解和各种降解量表需要强大的代表能力。为了使否定的网络看到各种降雪并建模本地细节和全球信息的上下文相互作用,我们提出了一种称为Snowformer的功能强大的建筑。首先,它在编码器中执行比例感知功能聚合,以捕获各种降解的丰富积雪信息。其次,为了解决大规模降级,它使用了解码器中的新颖上下文交互变压器块,该互动器块在全球上下文交互中从前范围内的局部细节和全局信息进行了上下文交互。并引入本地上下文互动可改善场景细节的恢复。第三,我们设计了一个异质的特征投影头,该功能投影头逐渐融合了编码器和解码器的特征,并将精制功能投影到干净的图像中。广泛的实验表明,所提出的雪诺形雪孔比其他SOTA方法取得了重大改进。与SOTA单图像HDCW-NET相比,它在CSD测试集上将PSNR度量提高了9.2dB。此外,与一般图像恢复体系结构NAFNET相比,PSNR的增加5.13db,这验证了我们的雪诺形雪地降雪任务的强大表示能力。该代码在\ url {https://github.com/ephemeral182/snowformer}中发布。
translated by 谷歌翻译
自我监督的单眼方法可以有效地了解弱纹理表面或反射性对象的深度信息。但是,由于单眼几何建模的固有歧义,深度精度受到限制。相反,由于多视图立体声(MVS)的成功,多帧深度估计方法提高了深度准确性,后者直接使用几何约束。不幸的是,MV经常患有无纹理区域,非斜角表面和移动物体,尤其是在没有已知的相机运动和深度监督的现实世界视频序列中。因此,我们提出了MoveEpth,它利用了单眼线索和速度指导来改善多帧深度学习。与现有的MVS深度和单眼深度之间一致性的方法不同,MoveEpth通过直接解决MV的固有问题来增强多帧深度学习。我们方法的关键是利用单眼深度作为几何优先级来构建MVS成本量,并根据预测的相机速度的指导来调整成本量的深度候选。我们通过学习成本量的不确定性来进一步融合单眼深度和MVS深度,从而导致深度估计多视图几何形状的歧义。广泛的实验表明,移动eptth达到了最先进的性能:与monodepth2和packnet相比,我们的方法相对地将深度准确性提高了20 \%和19.8 \%,而Kitti基准测试的方法则提高了。 MoveEpth还推广到更具挑战性的DDAD基准测试,相对超过7.2 \%。该代码可在https://github.com/jeffwang987/movedepth上获得。
translated by 谷歌翻译
在恶劣天气下降雪场景的图像恢复是一项艰巨的任务。雪图像具有复杂的降解,并在干净的图像上混乱,改变了干净的图像的分布。以前基于CNN的方法由于缺乏特定的全球建模能力,因此在恢复雪场景中完全恢复了雪场的挑战。在本文中,我们将视觉变压器应用于从单个图像中去除积雪的任务。具体而言,我们建议沿通道拆分的并行网络体系结构分别执行本地功能改进和全局信息建模。我们利用频道洗牌操作来结合其各自的优势以增强网络性能。其次,我们提出了MSP模块,该模块利用多规模的AVGPOOL来汇总不同大小的信息,并同时对多头自我注意力进行多尺度投影自我注意,以提高模型在不同规模下降下的表示能力。最后,我们设计了一个轻巧,简单的本地捕获模块,可以完善模型的本地捕获能力。在实验部分,我们进行了广泛的实验以证明我们方法的优越性。我们比较了三个雪场数据集上的先前清除方法。实验结果表明,我们的方法超过了更少的参数和计算的最新方法。在CSD测试数据集上,我们实现了1.99dB和SSIM 0.03的实质增长。在SRR和SNOW100K数据集上,与Transweather方法相比,我们还增加了2.47dB和1.62dB,在SSIM中提高了0.03。在视觉比较部分中,我们的MSP形式比现有方法获得了更好的视觉效果,证明了我们方法的可用性。
translated by 谷歌翻译
在冬季场景中,在雪下拍摄的图像的降解可能非常复杂,其中雪降解的空间分布因图像而异。最近的方法采用深层神经网络,直接从雪图像中恢复清洁的场景。但是,由于复杂的雪降解差异导致悖论,实时实现可靠的高清图像是一个巨大的挑战。我们开发了一种新型有效的金字塔网络,具有非对称编码器架构,用于实时高清图像。我们提出的网络的一般思想是通过功能中的多尺度特征流充分利用多尺度的特征流。与以前最先进的方法相比,我们的方法实现了更好的复杂性 - 性能取舍,并有效地处理了高清和超高清图像的处理困难。在三个大规模图像上进行的广泛实验表明,我们的方法超过了所有最新方法,既有数量又定性地超过了大幅度,从而将PSNR度量从31.76 dB提高到34.10 dB,升至34.10 dB。 SRRS测试数据集上的28.29 dB至30.87 dB。
translated by 谷歌翻译
布局规划在建筑和城市设计领域非常重要。在携带城市功能的各种基本单元中,住宅社区对支持人类生活起着至关重要的作用。因此,自深度学习的出现以来,居住社区的布局规划一直引起人们的关注,并引起了人们的特别关注,从而促进了自动化的布局产生和空间模式识别。但是,研究圈通常会遭受住宅社区布局基准或高质量数据集的不足,这阻碍了对住宅社区布局计划的数据驱动方法的未来探索。数据集的缺乏很大程度上是由于大规模现实世界中的住宅数据获取和长期专家筛查的困难。为了解决这些问题并推进基准数据集,用于智能城市开发中各种智能的空间设计和分析应用迄今为止,现实世界中的社区。 RECO数据集以多种数据格式呈现,其中包含37,646个住宅社区布局计划,涵盖了598,728个带有高度信息的住宅建筑。可以方便地适应与住宅社区布局相关的城市设计任务,例如生成布局设计,形态模式识别和空间评估。为了验证自动化住宅社区布局计划中RECO的实用性,基于生成的对抗网络(GAN)的生成模型进一步应用于数据集。我们希望Reco数据集能够激发智能设计及其他方面的更具创造力和实用性的工作。 RECO数据集发表在以下网址:https://www.kaggle.com/fdudsde/reco-dataset。
translated by 谷歌翻译
自动驾驶可以感知其周围的决策,这是视觉感知中最复杂的情​​况之一。范式创新在解决2D对象检测任务方面的成功激发了我们寻求优雅,可行和可扩展的范式,以从根本上推动该领​​域的性能边界。为此,我们在本文中贡献了BEVDET范式。 BEVDET在鸟眼视图(BEV)中执行3D对象检测,其中大多数目标值被定义并可以轻松执行路线计划。我们只是重复使用现有模块来构建其框架,但通过构建独家数据增强策略并升级非最大抑制策略来实质性地发展其性能。在实验中,BEVDET在准确性和时间效率之间提供了极好的权衡。作为快速版本,nuscenes val设置的BEVDET微小分数为31.2%的地图和39.2%的NDS。它与FCOS3D相当,但仅需要11%的计算预算为215.3 GFLOPS,并且在15.6 fps的速度中运行的速度快9.2倍。另一个称为BEVDET基本的高精度版本得分为39.3%的地图和47.2%的NDS,大大超过了所有已发布的结果。具有可比的推理速度,它超过了 +9.8%地图和 +10.0%ND的大幅度的FCOS3D。源代码可在https://github.com/huangjunjie2017/bevdet上公开研究。
translated by 谷歌翻译
Modern machine learning suffers from catastrophic forgetting when learning new classes incrementally. The performance dramatically degrades due to the missing data of old classes. Incremental learning methods have been proposed to retain the knowledge acquired from the old classes, by using knowledge distilling and keeping a few exemplars from the old classes. However, these methods struggle to scale up to a large number of classes. We believe this is because of the combination of two factors: (a) the data imbalance between the old and new classes, and (b) the increasing number of visually similar classes. Distinguishing between an increasing number of visually similar classes is particularly challenging, when the training data is unbalanced. We propose a simple and effective method to address this data imbalance issue. We found that the last fully connected layer has a strong bias towards the new classes, and this bias can be corrected by a linear model. With two bias parameters, our method performs remarkably well on two large datasets: ImageNet (1000 classes) and MS-Celeb-1M (10000 classes), outperforming the state-of-the-art algorithms by 11.1% and 13.2% respectively.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译