障碍物检测是机器人导航中的一个安全问题,即立体声匹配是一种流行的基于视觉的方法。尽管深度神经网络在计算机视觉中显示出令人印象深刻的结果,但以前的大多数障碍物检测都仅利用传统的立体声匹配技术来满足实时反馈的计算限制。本文提出了一种计算高效的方法,该方法利用深度神经网络直接从立体声图像中检测占用率。我们的方法没有从立体声数据中学习点云对应,而是根据体积表示提取紧凑的障碍物分布。此外,我们根据解码器产生的OCTREES以粗到1的方式修剪安全空间的计算。结果,我们在机载计算机上实现实时性能(NVIDIA JETSON TX2)。我们的方法可检测到32米的范围准确的障碍,并以最先进的立体声模型的计算成本的2%的计算成本获得了更好的IOU(相交)和CD(倒角距离)。此外,我们通过使用真实机器人进行自主导航实验来验证方法的鲁棒性和现实世界的可行性。因此,我们的工作有助于缩小机器人感知中基于立体声的系统与计算机视觉中最新的立体声模型之间的差距。为了应对高质量的现实世界立体声数据集的稀缺性,我们收集了一个1.36小时的立体声数据集,该数据集用jackal机器人来微调我们的模型。数据集,代码和更多可视化可在https://lhy.xyz/stereovoxelnet/上获得
translated by 谷歌翻译
视觉变压器(VIT)正在出现,并且在计算机视觉任务中的准确性显着提高。但是,它们的复杂架构和巨大的计算/存储需求对新硬件加速器设计方法施加了紧迫的需求。这项工作提出了基于提议的混合速度量化的FPGA感知自动VIT加速框架。据我们所知,这是探索模型量化的第一个基于FPGA的VIT加速框架。与最先进的VIT量化工作(仅无硬件加速的算法方法)相比,我们的量化在相同的位宽度下可实现0.47%至1.36%的TOP-1精度。与32位浮点基线FPGA加速器相比,我们的加速器在框架速率上的提高约为5.6倍(即56.8 fps vs. 10.0 fps),对于DeitBase的ImagEnet数据集,精度下降了0.71%。
translated by 谷歌翻译
最近,传输位更少的高质量视频会议已成为一个非常炎热且充满挑战的问题。我们提出了FAIVCONF,这是一个基于有效的神经人类面部生成技术的特殊设计的视频会议视频压缩框架。FAIVCONF汇集了几种设计,以改善实际视频会议场景中的系统鲁棒性:面部交换,以避免在背景动画中进行工件;面部模糊以减少传输位量并保持提取的面部地标的质量;面部视图插值的动态源更新,以适应各种头部姿势。我们的方法在视频会议上实现了显着的比率降低,与H.264和H.265编码方案相比,在相同的位速率下可提供更好的视觉质量。
translated by 谷歌翻译
重量修剪是一种有效的模型压缩技术,可以解决在移动设备上实现实时深神经网络(DNN)推断的挑战。然而,由于精度劣化,难以利用硬件加速度,以及某些类型的DNN层的限制,难以降低的应用方案具有有限的应用方案。在本文中,我们提出了一般的细粒度的结构化修剪方案和相应的编译器优化,适用于任何类型的DNN层,同时实现高精度和硬件推理性能。随着使用我们的编译器优化所支持的不同层的灵活性,我们进一步探讨了确定最佳修剪方案的新问题,了解各种修剪方案的不同加速度和精度性能。两个修剪方案映射方法,一个是基于搜索,另一个是基于规则的,建议自动推导出任何给定DNN的每层的最佳修剪规则和块大小。实验结果表明,我们的修剪方案映射方法,以及一般细粒化结构修剪方案,优于最先进的DNN优化框架,最高可达2.48 $ \ times $和1.73 $ \ times $ DNN推理加速在CiFar-10和Imagenet DataSet上没有准确性损失。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译