Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.
translated by 谷歌翻译
MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1K: it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with GELU yet achieves better performance. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.
translated by 谷歌翻译
Recent advances in neural approaches greatly improve task-oriented dialogue (TOD) systems which assist users to accomplish their goals. However, such systems rely on costly manually labeled dialogs which are not available in practical scenarios. In this paper, we present our models for Track 2 of the SereTOD 2022 challenge, which is the first challenge of building semi-supervised and reinforced TOD systems on a large-scale real-world Chinese TOD dataset MobileCS. We build a knowledge-grounded dialog model to formulate dialog history and local KB as input and predict the system response. And we perform semi-supervised pre-training both on the labeled and unlabeled data. Our system achieves the first place both in the automatic evaluation and human interaction, especially with higher BLEU (+7.64) and Success (+13.6\%) than the second place.
translated by 谷歌翻译
Pastic分割结合了语义和实例细分的优势,可以为智能车辆提供像素级和实例级别的环境感知信息。但是,它挑战各种尺度的对象,尤其是在极小的和小的物体上。在这项工作中,我们提出了两个轻量级模块来减轻此问题。首先,Pixel-ReSation Block旨在为大规模事物建模全局上下文信息,该信息基于与查询无关的公式,并带来小参数增量。然后,构建对流网络以收集针对小规模内容的额外高分辨率信息,为下游分割分支提供更合适的语义功能。基于这两个模块,我们提出了一个端到端尺度意识到的统一网络(Sunet),该网络更适合多尺度对象。对城市景观和可可的广泛实验证明了所提出的方法的有效性。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
在监控视频中的异常检测是挑战,对确保公共安全有挑战性。不同于基于像素的异常检测方法,基于姿势的方法利用高结构化的骨架数据,这降低了计算负担,并避免了背景噪声的负面影响。然而,与基于像素的方法不同,这可以直接利用显式运动特征,例如光学流,基于姿势的方法缺乏替代动态表示。在本文中,提出了一种新的运动嵌入器(ME)以提供从概率的角度来提供姿态运动表示。此外,为自我监控姿势序列重建部署了一种新型任务特定的空间 - 时间变压器(STT)。然后将这两个模块集成到统一规律学习的统一框架中,该框架被称为运动先前规律学习者(MOPLL)。 MOPRL在几个具有挑战性的数据集中实现了4.7%AUC的平均改善,实现了最先进的性能。广泛的实验验证每个提出的模块的多功能性。
translated by 谷歌翻译
translated by 谷歌翻译
变压器在计算机视觉任务中表现出很大的潜力。常见的信念是他们的注意力令牌混合器模块对他们的能力做出了贡献。但是,最近的作品显示了变压器中的基于关注的模块可以被空间MLP所取代,由此产生的模型仍然表现得很好。基于该观察,我们假设变压器的一般架构,而不是特定的令牌混音器模块对模型的性能更为必要。为了验证这一点,我们刻意用尴尬的简单空间池汇集操作员取代变压器中的注意模块,以仅进行最基本的令牌混合。令人惊讶的是,我们观察到,派生模型称为池,在多台计算机视觉任务上实现了竞争性能。例如,在ImageNet-1K上,泳池制造器实现了82.1%的前1个精度,超越了调节的视觉变压器/ MLP样基线Deit-B / ResmmP-B24,比参数的35%/ 52%的准确度为0.3%/ 1.1%和48%/ 60%的Mac。泳道的有效性验证了我们的假设,并敦促我们启动“MetaFormer”的概念,这是一个从变压器抽象的一般架构,而无需指定令牌混音器。基于广泛的实验,我们认为MetaFormer是在视觉任务上实现最近变压器和MLP样模型的优越结果的关键球员。这项工作要求更具未来的研究,专门用于改善元形器,而不是专注于令牌混音器模块。此外,我们提出的池更换器可以作为未来的MetaFormer架构设计的起始基线。代码可在使用
translated by 谷歌翻译