translated by 谷歌翻译
图池是用于编码图中层次结构的关键操作。大多数现有的图形池方法将问题作为节点聚类任务提出,从而有效捕获图形拓扑。常规方法要求用户指定适当数量的簇作为超参数,然后假设所有输入图共享相同数量的簇。但是,在簇数可以变化的归纳设置中,该模型应能够表示其池层中的这种变化,以学习合适的簇。因此,我们提出了GMPool,这是一种新型可区分的图形池体系结构,该体系结构会根据输入数据自动确定适当数量的簇数。主要直觉涉及定义为合并操作员的二次形式的分组矩阵,该矩阵诱导了节点成对组合的二进制分类概率的使用。 GMPool首先计算分组矩阵,然后将其分解。对分子财产预测任务的广泛评估表明,我们的方法表现优于常规方法。
translated by 谷歌翻译
计算机视觉和机器学习中的许多问题都可以作为代表高阶关系的超图的学习。 HyperGraph Learning的最新方法基于消息传递扩展了图形神经网络,这在建模远程依赖性和表达能力方面很简单但根本上有限。另一方面,基于张量的模棱两可的神经网络具有最大的表现力,但是由于沉重的计算和对固定顺序超中件的严格假设,它们的应用受到了超图的限制。我们解决了这些问题,并目前呈现了模棱两可的HyperGraph神经网络(EHNN),这是实现一般超图学习最大表达性的层的首次尝试。我们还提出了基于超网(EHNN-MLP)和自我注意力(EHNN-TransFormer)的两个实用实现,这些实现易于实施,理论上比大多数消息传递方法更具表现力。我们证明了它们在一系列超图学习问题中的能力,包括合成K边缘识别,半监督分类和视觉关键点匹配,并报告对强烈消息传递基线的改进性能。我们的实施可从https://github.com/jw9730/ehnn获得。
translated by 谷歌翻译
半监督视频对象细分(VOS)旨在密集跟踪视频中的某些指定对象。该任务中的主要挑战之一是存在与目标对象相似的背景干扰物的存在。我们提出了三种抑制此类干扰因素的新型策略:1)一种时空多元化的模板构建方案,以获得目标对象的广义特性; 2)可学习的距离得分函数,可通过利用两个连续帧之间的时间一致性来排除空间距离的干扰因素; 3)交换和连接的扩展通过提供包含纠缠对象的训练样本来迫使每个对象具有独特的功能。在所有公共基准数据集中,即使是实时性能,我们的模型也与当代最先进的方法相当。定性结果还证明了我们的方法优于现有方法。我们认为,我们的方法将被广泛用于未来的VOS研究。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, however, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a noise-aware learning framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed quality controllable model, which is learned using alignment levels of the image-text pairs as an additional control signal during training. The alignment-conditioned training allows the model to generate high-quality captions of well-aligned by simply setting the control signal to desired alignment level at inference time. Through in-depth analysis, we show that our controllable captioning model is effective in handling noise. In addition, with two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. Code is available at \url{https://github.com/kakaobrain/noc}.
translated by 谷歌翻译
In this paper, we propose a diffusion-based face swapping framework for the first time, called DiffFace, composed of training ID conditional DDPM, sampling with facial guidance, and a target-preserving blending. In specific, in the training process, the ID conditional DDPM is trained to generate face images with the desired identity. In the sampling process, we use the off-the-shelf facial expert models to make the model transfer source identity while preserving target attributes faithfully. During this process, to preserve the background of the target image and obtain the desired face swapping result, we additionally propose a target-preserving blending strategy. It helps our model to keep the attributes of the target face from noise while transferring the source facial identity. In addition, without any re-training, our model can flexibly apply additional facial guidance and adaptively control the ID-attributes trade-off to achieve the desired results. To the best of our knowledge, this is the first approach that applies the diffusion model in face swapping task. Compared with previous GAN-based approaches, by taking advantage of the diffusion model for the face swapping task, DiffFace achieves better benefits such as training stability, high fidelity, diversity of the samples, and controllability. Extensive experiments show that our DiffFace is comparable or superior to the state-of-the-art methods on several standard face swapping benchmarks.
translated by 谷歌翻译
Through in-context learning (ICL), large-scale language models are effective few-shot learners without additional model fine-tuning. However, the ICL performance does not scale well with the number of available training samples as it is limited by the inherent input length constraint of the underlying language model. Meanwhile, many studies have revealed that language models are also powerful feature extractors, allowing them to be utilized in a black-box manner and enabling the linear probing paradigm, where lightweight discriminators are trained on top of the pre-extracted input representations. This paper proposes prompt-augmented linear probing (PALP), a hybrid of linear probing and ICL, which leverages the best of both worlds. PALP inherits the scalability of linear probing and the capability of enforcing language models to derive more meaningful representations via tailoring input into a more conceivable form. Throughout in-depth investigations on various datasets, we verified that PALP significantly enhances the input representations closing the gap between ICL in the data-hungry scenario and fine-tuning in the data-abundant scenario with little training overhead, potentially making PALP a strong alternative in a black-box scenario.
translated by 谷歌翻译