文本分类的对比预制技术已经在一个无人监督的环境中进行了很大程度上。但是,通常可以使用与当前任务共享标签语义的相关任务的数据。我们假设使用此标记数据有效地导致当前任务的更好的概括。在本文中,我们提出了一种新的方法来利用基于曲线图的监督对比学习方法有效地利用相关任务的标记数据。我们通过将监督信息从示例推断到令牌来制定令牌图。我们的配方导致嵌入空间的嵌入空间,其中具有相同类的高/低概率的令牌彼此接近/进一步。我们还开发了详细的理论见解,该洞察力作为我们方法的动机。在我们的实验中,我们将展示我们的方法以2.5美元的价格优于预先预订计划,以及基于1,8 \%$ 1.8 \%$ 1.8 \%$ 1.8 \%$ 1.8 \%$ 1.8 \%。此外,我们在零击设置中显示了我们的方法的跨域效果,平均每次3.91 \%$ 3.91 \%。最后,我们还展示了我们的方法可以用作知识蒸馏设定中的嘈杂教师,以显着提高基于变压器的模型在低标记的数据制度中的性能,平均为4.57 \%$ 4.57 \%。
translated by 谷歌翻译
We propose a technique for learning single-view 3D object pose estimation models by utilizing a new source of data -- in-the-wild videos where objects turn. Such videos are prevalent in practice (e.g., cars in roundabouts, airplanes near runways) and easy to collect. We show that classical structure-from-motion algorithms, coupled with the recent advances in instance detection and feature matching, provides surprisingly accurate relative 3D pose estimation on such videos. We propose a multi-stage training scheme that first learns a canonical pose across a collection of videos and then supervises a model for single-view pose estimation. The proposed technique achieves competitive performance with respect to existing state-of-the-art on standard benchmarks for 3D pose estimation, without requiring any pose labels during training. We also contribute an Accidental Turntables Dataset, containing a challenging set of 41,212 images of cars in cluttered backgrounds, motion blur and illumination changes that serves as a benchmark for 3D pose estimation.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary {modalities} -- sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, we focus on learning a flexible joint embedding that fully supports the ``optionality" that this complementarity brings. Our embedding supports optionality on two axis: (i) optionality across modalities -- use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks -- simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy at the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
We investigate how well CLIP understands texture in natural images described by natural language. To this end, we analyze CLIP's ability to: (1) perform zero-shot learning on various texture and material classification datasets; (2) represent compositional properties of texture such as red dots or yellow stripes on the Describable Texture in Detail(DTDD) dataset; and (3) aid fine-grained categorization of birds in photographs described by color and texture of their body parts.
translated by 谷歌翻译