The evaluation of abstractive summarization models typically uses test data that is identically distributed as training data. In real-world practice, documents to be summarized may contain input noise caused by text extraction artifacts or data pipeline bugs. The robustness of model performance under distribution shift caused by such noise is relatively under-studied. We present a large empirical study quantifying the sometimes severe loss in performance (up to 12 ROUGE-1 points) from different types of input noise for a range of datasets and model sizes. We then propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any extra training, auxiliary models, or even prior knowledge of the type of noise. Our proposed approach effectively mitigates the loss in performance, recovering a large fraction of the performance drop, sometimes as large as 11 ROUGE-1 points.
translated by 谷歌翻译
Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Recently, plenty of works are proposed, following a similar paradigm consisting of three essential components, i.e., camera feature extraction, BEV feature construction, and task heads. Among the three components, BEV feature construction is BEV-specific compared with 2D tasks. Existing methods aggregate the multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. For example, the barrier is located at a low height while the truck is located at a high height. In this paper, we propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights. We conduct detailed experiments to demonstrate the effectiveness of BEV-SAN. Code will be released.
translated by 谷歌翻译
When facing changing environments in the real world, the lightweight model on client devices suffers from severe performance drops under distribution shifts. The main limitations of the existing device model lie in (1) unable to update due to the computation limit of the device, (2) the limited generalization ability of the lightweight model. Meanwhile, recent large models have shown strong generalization capability on the cloud while they can not be deployed on client devices due to poor computation constraints. To enable the device model to deal with changing environments, we propose a new learning paradigm of Cloud-Device Collaborative Continual Adaptation, which encourages collaboration between cloud and device and improves the generalization of the device model. Based on this paradigm, we further propose an Uncertainty-based Visual Prompt Adapted (U-VPA) teacher-student model to transfer the generalization capability of the large model on the cloud to the device model. Specifically, we first design the Uncertainty Guided Sampling (UGS) to screen out challenging data continuously and transmit the most out-of-distribution samples from the device to the cloud. Then we propose a Visual Prompt Learning Strategy with Uncertainty guided updating (VPLU) to specifically deal with the selected samples with more distribution shifts. We transmit the visual prompts to the device and concatenate them with the incoming data to pull the device testing distribution closer to the cloud training distribution. We conduct extensive experiments on two object detection datasets with continually changing environments. Our proposed U-VPA teacher-student framework outperforms previous state-of-the-art test time adaptation and device-cloud collaboration methods. The code and datasets will be released.
translated by 谷歌翻译
Recently, Bird's-Eye-View (BEV) representation has gained increasing attention in multi-view 3D object detection, which has demonstrated promising applications in autonomous driving. Although multi-view camera systems can be deployed at low cost, the lack of depth information makes current approaches adopt large models for good performance. Therefore, it is essential to improve the efficiency of BEV 3D object detection. Knowledge Distillation (KD) is one of the most practical techniques to train efficient yet accurate models. However, BEV KD is still under-explored to the best of our knowledge. Different from image classification tasks, BEV 3D object detection approaches are more complicated and consist of several components. In this paper, we propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner. However, directly applying the teacher-student paradigm to BEV features fails to achieve satisfying results due to heavy background information in RGB cameras. To solve this problem, we propose to leverage the localization advantage of LiDAR points. Specifically, we transform the LiDAR points to BEV space and generate the foreground mask and view-dependent mask for the teacher-student paradigm. It is to be noted that our method only uses LiDAR points to guide the KD between RGB models. As the quality of depth estimation is crucial for BEV perception, we further introduce depth distillation to our framework. Our unified framework is simple yet effective and achieves a significant performance boost. Code will be released.
translated by 谷歌翻译
Vision-Centric Bird-Eye-View (BEV) perception has shown promising potential and attracted increasing attention in autonomous driving. Recent works mainly focus on improving efficiency or accuracy but neglect the domain shift problem, resulting in severe degradation of transfer performance. With extensive observations, we figure out the significant domain gaps existing in the scene, weather, and day-night changing scenarios and make the first attempt to solve the domain adaption problem for multi-view 3D object detection. Since BEV perception approaches are usually complicated and contain several components, the domain shift accumulation on multi-latent spaces makes BEV domain adaptation challenging. In this paper, we propose a novel Multi-level Multi-space Alignment Teacher-Student ($M^{2}ATS$) framework to ease the domain shift accumulation, which consists of a Depth-Aware Teacher (DAT) and a Multi-space Feature Aligned (MFA) student model. Specifically, DAT model adopts uncertainty guidance to sample reliable depth information in target domain. After constructing domain-invariant BEV perception, it then transfers pixel and instance-level knowledge to student model. To further alleviate the domain shift at the global level, MFA student model is introduced to align task-relevant multi-space features of two domains. To verify the effectiveness of $M^{2}ATS$, we conduct BEV 3D object detection experiments on four cross domain scenarios and achieve state-of-the-art performance (e.g., +12.6% NDS and +9.1% mAP on Day-Night). Code and dataset will be released.
translated by 谷歌翻译
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
translated by 谷歌翻译
鉴于其广泛的应用,已经对人面部交换的任务进行了许多尝试。尽管现有的方法主要依赖于乏味的网络和损失设计,但它们仍然在源和目标面之间的信息平衡中挣扎,并倾向于产生可见的人工制品。在这项工作中,我们引入了一个名为StylesWap的简洁有效的框架。我们的核心想法是利用基于样式的生成器来增强高保真性和稳健的面部交换,因此可以采用发电机的优势来优化身份相似性。我们仅通过最小的修改来确定,StyleGAN2体系结构可以成功地处理来自源和目标的所需信息。此外,受到TORGB层的启发,进一步设计了交换驱动的面具分支以改善信息的融合。此外,可以采用stylegan倒置的优势。特别是,提出了交换引导的ID反转策略来优化身份相似性。广泛的实验验证了我们的框架会产生高质量的面部交换结果,从而超过了最先进的方法,既有定性和定量。
translated by 谷歌翻译
三维荧光显微镜通常遭受各向异性的影响,沿轴向方向的分辨率低于侧面成像平面内的分辨率。我们通过提出双周期来解决此问题,这是双环荧光图像的关节反卷积和融合的新框架。受到最近的神经清性方法的启发,双周期被设计为一种循环一致的生成网络,通过结合双视发电机和先前引导的退化模型,以自我监督的方式训练。我们在合成数据和真实数据上验证双周期,显示其最先进的性能,而无需任何外部培训数据。
translated by 谷歌翻译
深度估计对于各种重要的现实世界应用至关重要,例如自动驾驶。但是,在高速场景中,它遭受了严重的性能退化,因为传统相机只能捕获模糊的图像。为了解决这个问题,Spike摄像头旨在以高框架速率捕获像素的亮度强度。但是,使用传统的单眼或立体声深度估计算法,使用尖峰摄像机的深度估计仍然非常具有挑战性,这些算法基于光度一致性。在本文中,我们提出了一种新型的不确定性引导深度融合(UGDF)框架,以融合Spike摄像机的单眼和立体声深度估计网络的预测。我们的框架是由于立体声尖峰深度估计在近距离取得更好的结果,而单眼尖峰深度估计获得了更好的结果。因此,我们引入了具有联合培训策略的双任务深度估计结构,并估算了分布式不确定性以融合单眼和立体声结果。为了证明尖峰深度估计比传统的摄像头深度估计的优势,我们为一个名为CitySpike20k的尖峰深度数据集,其中包含20k配对的样品,以进行尖峰深度估计。 UGDF在CitySpike20k上取得了最新的结果,超过了所有单眼或立体声尖峰深度估计基线。我们进行了广泛的实验,以评估我们方法对CitySpike20k的有效性和概括。据我们所知,我们的框架是第一个用于尖峰摄像头深度估算的双任务融合框架。代码和数据集将发布。
translated by 谷歌翻译
神经形态尖峰摄像机以生物启发的方式生成具有高时间分辨率的数据流,该方式在自动驾驶等现实世界应用中具有巨大的潜力。与RGB流相反,Spike流具有克服运动模糊的固有优势,从而导致对高速对象的更准确的深度估计。但是,几乎不可能以监督的方式培训尖峰深度估计网络,因为获得时间密集的尖峰流的配对深度标签非常费力和挑战。在本文中,我们没有构建带有完整深度标签的Spike流数据集,而是以不受监督的方式从开源RGB数据集(例如Kitti)和估算峰值深度转移知识。此类问题的关键挑战在于RGB和SPIKE模式之间的模态差距,以及标记的源RGB和未标记的目标尖峰域之间的域间隙。为了克服这些挑战,我们引入了无监督的尖峰深度估计的跨模式跨域(BICROSS)框架。我们的方法通过引入中介模拟的源尖峰域来缩小源RGB和目标尖峰之间的巨大差距。要具体而言,对于跨模式阶段,我们提出了一种新颖的粗到精细知识蒸馏(CFKD),将图像和像素级知识从源RGB转移到源尖峰。这种设计分别利用了RGB和SPIKE模式的大量语义和密集的时间信息。对于跨域阶段,我们引入了不确定性引导的均值老师(UGMT),以生成具有不确定性估计的可靠伪标签,从而减轻了源尖峰和目标尖峰域之间的变化。此外,我们提出了一种全局级特征对齐方法(GLFA),以对齐两个域之间的特征并生成更可靠的伪标签。
translated by 谷歌翻译