Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, \eg, a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.
translated by 谷歌翻译
Object compositing based on 2D images is a challenging problem since it typically involves multiple processing stages such as color harmonization, geometry correction and shadow generation to generate realistic results. Furthermore, annotating training data pairs for compositing requires substantial manual effort from professionals, and is hardly scalable. Thus, with the recent advances in generative models, in this work, we propose a self-supervised framework for object compositing by leveraging the power of conditional diffusion models. Our framework can hollistically address the object compositing task in a unified model, transforming the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling. To preserve the input object's characteristics, we introduce a content adaptor that helps to maintain categorical semantics and object appearance. A data augmentation method is further adopted to improve the fidelity of the generator. Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.
translated by 谷歌翻译
激光射道是激光雷达同时定位和映射(SLAM)的重要部分之一。但是,现有的LiDAR探光法倾向于将新的扫描与以前的固定置扫描相匹配,并逐渐累积错误。此外,作为一种有效的关节优化机制,由于大规模全球地标的密集计算,捆绑捆绑调整(BA)不能直接引入实时探光仪。因此,这封信设计了一种新策略,称为LINDAR SLAM中的捆绑调节探针仪(LMBAO)的具有里程碑意义的地图,以解决这些问题。首先,通过主动地标维护策略进一步开发了基于BA的进程法,以进行更准确的本地注册并避免累积错误。具体来说,本文将整个稳定地标在地图上保存,而不仅仅是在滑动窗口中的特征点,并根据其主动等级删除地标。接下来,减小滑动窗口长度,并执行边缘化以保留窗口外的扫描,但对应于地图上的活动地标,从而大大简化了计算并改善了实时属性。此外,在三个具有挑战性的数据集上进行的实验表明,我们的算法在户外驾驶中实现了实时性能,并且超过了最先进的激光雷达大满贯算法,包括乐高乐园和VLOM。
translated by 谷歌翻译
艺术文本识别是一项极具挑战性的任务,具有广泛的应用程序。但是,当前场景文本识别方法主要集中于不规则文本,而未专门探讨艺术文本。艺术文本识别的挑战包括具有特殊设计的字体和效果的各种外观,字符之间的复杂连接和重叠以及背景模式的严重干扰。为了减轻这些问题,我们建议在三个层面上识别艺术文本。首先,考虑到角结构对外观和形状的稳健性,使用角点指导角色内部特征的提取。通过这种方式,角点的离散性切断了字符之间的连接,它们的稀疏性改善了背景干扰的稳健性。其次,我们设计了一个字符对比损失,以模拟字符级别的特征,从而改善了字符分类的特征表示。第三,我们利用变形金刚在图像级别上学习全局功能,并在角落跨注意机制的帮助下对角点的全球关系进行建模。此外,我们提供了一个艺术文本数据集来基准表演。实验结果验证了我们提出的方法在艺术文本识别方面的显着优势,并在几个模糊和透视数据集上实现了最先进的性能。
translated by 谷歌翻译
我们提出Simprov-可扩展的图像出处框架,将查询图像匹配回到可信的原始数据库,并在查询上确定可能的操作。 Simprov由三个阶段组成:检索Top-K最相似图像的可扩展搜索阶段;一个重新排列和近乎解复的检测阶段,用于识别候选人之间的原件;最后,在查询中定位区域的操纵检测和可视化阶段可能被操纵与原始区域不同。 Simprov对在线再分配过程中通常发生的良性图像转换非常强大,例如由于噪声和重新压缩降解而引起的工件,以及由于图像填充,翘曲,尺寸和形状的变化而引起的过度转换。通过对比较器体系结构中可区分的翘曲模块的端到端训练,可以实现对实地转换的鲁棒性。我们证明了对1亿张图像的数据集的有效检索和操纵检测。
translated by 谷歌翻译
字体遍布文档普遍存在,有各种风格。它们以本机向量格式表示或光栅化以产生固定分辨率图像。在第一种情况下,非标准表示可防止受益于最新网络架构进行神经表示;虽然在后一种情况下,在通过网络编码时,光栅化表示导致数据保真度丢失,作为像边缘和角落的字体特定的不连续性难以使用神经网络代表。基于观察到复杂字体可以通过一组更简单的占用函数的叠加来表示,我们介绍\ texit {multi-inclicicits}以将字体表示为置换不变集的学习隐含功能,而不会丢失特征(例如,棱角)。然而,虽然多种含义本地保护字体特征,但以地面真理多通道信号的形式获得监控是本身的问题。相反,我们提出了如何只用本地监督培训这种表示,而建议的神经架构直接发现字体系列的全球一致的多型多种含义。我们广泛地评估了各种任务的建议代表,包括重建,插值和综合,以证明具有现有替代品的明显优势。另外,表示自然地启用字形完成,其中单个特征字体用于在目标样式中综合整个字体系列。
translated by 谷歌翻译
If I provide you a face image of mine (without telling you the actual age when I took the picture) and a large amount of face images that I crawled (containing labeled faces of different ages but not necessarily paired), can you show me what I would look like when I am 80 or what I was like when I was 5?" The answer is probably a "No."Most existing face aging works attempt to learn the transformation between age groups and thus would require the paired samples as well as the labeled query image. In this paper, we look at the problem from a generative modeling perspective such that no paired samples is required. In addition, given an unlabeled image, the generative model can directly produce the image with desired age attribute. We propose a conditional adversarial autoencoder (CAAE) that learns a face manifold, traversing on which smooth age progression and regression can be realized simultaneously. In CAAE, the face is first mapped to a latent vector through a convolutional encoder, and then the vector is projected to the face manifold conditional on age through a deconvolutional generator. The latent vector preserves personalized face features (i.e., personality) and the age condition controls progression vs. regression. Two adversarial networks are imposed on the encoder and generator, respectively, forcing to generate more photo-realistic faces. Experimental results demonstrate the appealing performance and flexibility of the proposed framework by comparing with the state-of-the-art and ground truth.
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译