There is a growing interest in developing unlearnable examples (UEs) against visual privacy leaks on the Internet. UEs are training samples added with invisible but unlearnable noise, which have been found can prevent unauthorized training of machine learning models. UEs typically are generated via a bilevel optimization framework with a surrogate model to remove (minimize) errors from the original samples, and then applied to protect the data against unknown target models. However, existing UE generation methods all rely on an ideal assumption called label-consistency, where the hackers and protectors are assumed to hold the same label for a given sample. In this work, we propose and promote a more practical label-agnostic setting, where the hackers may exploit the protected data quite differently from the protectors. E.g., a m-class unlearnable dataset held by the protector may be exploited by the hacker as a n-class dataset. Existing UE generation methods are rendered ineffective in this challenging setting. To tackle this challenge, we present a novel technique called Unlearnable Clusters (UCs) to generate label-agnostic unlearnable examples with cluster-wise perturbations. Furthermore, we propose to leverage VisionandLanguage Pre-trained Models (VLPMs) like CLIP as the surrogate model to improve the transferability of the crafted UCs to diverse domains. We empirically verify the effectiveness of our proposed approach under a variety of settings with different datasets, target models, and even commercial platforms Microsoft Azure and Baidu PaddlePaddle.
translated by 谷歌翻译
Although significant progress has been made in few-shot learning, most of existing few-shot learning methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale self-supervised vision-language models (e.g., CLIP) have provided a new paradigm for transferable visual representation learning. However, the pre-trained VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier in few-shot classification. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
translated by 谷歌翻译
Pre-trained Vision-Language Models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using task-specific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first gets better and then gets worse during training; (ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problem effectively. In this paper, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages respectively, leading to the non-overfitting and overfitting phenomenon. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process, and successfully eliminate the overfitting problem. Besides, we equip CoOp with Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art approach CoCoOp. Experiments on more challenging vision downstream tasks including open-vocabulary object detection and zero-shot semantic segmentation also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.
translated by 谷歌翻译
数字艺术合成在多媒体社区中受到越来越多的关注,因为有效地与公众参与了艺术。当前的数字艺术合成方法通常使用单模式输入作为指导,从而限制了模型的表现力和生成结果的多样性。为了解决这个问题,我们提出了多模式引导的艺术品扩散(MGAD)模型,该模型是一种基于扩散的数字艺术品生成方法,它利用多模式提示作为控制无分类器扩散模型的指导。此外,对比度语言图像预处理(剪辑)模型用于统一文本和图像模式。关于生成的数字艺术绘画质量和数量的广泛实验结果证实了扩散模型和多模式指导的组合有效性。代码可从https://github.com/haha-lisa/mgad-multimodal-guided-artwork-diffusion获得。
translated by 谷歌翻译
大多数现有的最新视频分类方法假设训练数据遵守统一的分布。但是,现实世界中的视频数据通常会表现出不平衡的长尾巴分布,从而导致模型偏见对头等阶层,并且在尾巴上的性能相对较低。虽然当前的长尾分类方法通常集中在图像分类上,但将其调整到视频数据并不是微不足道的扩展。我们提出了一种端到端的多专家分布校准方法,以基于两级分布信息来应对这些挑战。该方法共同考虑了每个类别中样品的分布(类内部分布)和各种数据(类间分布)的总体分布,以解决在长尾分布下数据不平衡数据的问题。通过对两级分布信息进行建模,该模型可以共同考虑头等阶层和尾部类别,并将知识从头等阶层显着转移,以提高尾部类别的性能。广泛的实验验证了我们的方法是否在长尾视频分类任务上实现了最先进的性能。
translated by 谷歌翻译
最近,通过计算各个特征和集群记忆之间的对比损失,群集对比度学习已被证明对人员Reid有效。但是,使用各个功能以势头更新群集内存的现有方法对嘈杂的样本不稳健,例如具有错误注释标签或伪标签的样本。与基于个人的更新机制不同,基于质心的更新机制应用每个群集的平均特征更新群集内存对少数噪声样本是强大的。因此,我们制定了一个名为双集群对比学习(DCC)的统一集群对比框架中的基于个人的更新和基于质心的更新机制,它维护了两种类型的存储体:个人和质心集群存储库。值得注意的是,基于各个功能更新各个集群内存。质心群集内存应用每个Cluter的平均特征以更新相应的群集内存。除了每个存储器的Vallina对比损耗之外,应用了一致性约束,以保证两个存储器输出的一致性。请注意,通过使用聚类方法生成的地面真理标签或伪标签,可以轻松地应用于无监督或监督人员REID。在监督人员REID和无人监督者REID下的两项基准的大量实验证明了拟议的DCC的优越。代码可用:https://github.com/htyao89/dual-cluster-contrastive/
translated by 谷歌翻译
行人轨迹预测是自动驾驶的重要技术,近年来已成为研究热点。以前的方法主要依靠行人的位置关系来模型社交互动,这显然不足以代表实际情况中的复杂病例。此外,大多数现有工作通常通常将场景交互模块作为独立分支介绍,并在轨迹生成过程中嵌入社交交互功能,而不是同时执行社交交互和场景交互,这可能破坏轨迹预测的合理性。在本文中,我们提出了一个名为社会软关注图卷积网络(SSAGCN)的一个新的预测模型,旨在同时处理行人和环境之间的行人和场景相互作用之间的社交互动。详细说明,在建模社交互动时,我们提出了一种新的\ EMPH {社会软关注功能},其充分考虑了行人之间的各种交互因素。并且它可以基于各种情况下的不同因素来区分行人周围的人行力的影响。对于物理互动,我们提出了一个新的\ emph {顺序场景共享机制}。每个时刻在每个时刻对一个代理的影响可以通过社会柔和关注与其他邻居共享,因此场景的影响在空间和时间尺寸中都是扩展。在这些改进的帮助下,我们成功地获得了社会和身体上可接受的预测轨迹。公共可用数据集的实验证明了SSAGCN的有效性,并取得了最先进的结果。
translated by 谷歌翻译
图形神经网络(GNNS)通过提取和传播结构感知功能来处理曲线数据的成功。现有的GNN研究设计了各种传播方案来指导邻居信息的聚合。最近,该字段从本地传播方案中提升,专注于朝向延长传播方案的本地传播方案,这些方案可以直接处理由本地和高阶邻居的扩展邻居。尽管表现令人印象深刻,但现有的方法仍然不足以建立一种高效和学习的扩展传播方案,可以自适应地调整本地和高阶邻居的影响。本文提出了一种有效但有效的端到端框架,即对比自适应传播图形神经网络(CAPGNN),通过组合个性化PageRank和注意技术来解决这些问题。 CAPGNN利用稀疏局部亲和矩阵的多项式模拟了学习的扩展传播方案,其中多项式依赖于个性化PageRank以提供卓越的初始系数。为了自适应地调整局部和高阶邻居的影响,引入了系数关注模型,以便学习调节多项式的系数。此外,我们利用自我监督的学习技巧,设计了无与伦比的熵感知对比损失,以明确地利用未标记的培训数据。我们将CAPGNN作为名为CAPGCN和CAPGAT的两个不同版本,分别使用静态和动态稀疏的本地亲和矩阵。图表基准数据集上的实验表明CAPGNN可以始终如一地优于或匹配最先进的基线。源代码在https://github.com/hujunxianligong/capgnn上公开提供。
translated by 谷歌翻译
我们针对虚线监督的视频对象接地(WSVog)的任务,其中仅在模型学习期间只提供视频句子注释。它旨在将句子中描述的对象本地化为视频中的视觉区域,这是模式分析和机器学习中所需的基本功能。尽管最近的进展,但现有的方法都遭受了虚假协会的严重问题,这将损害接地性能。在本文中,我们从WSVog的定义开始,从两个方面定位虚假关联:(1)协会本身由于监督弱而不是对象相关但极其暧昧,而(2)联想是不可避免的在现有方法中采用基于统计数据的匹配策略时观察偏见。考虑到这一点,我们设计一个统一的因果框架,以了解Deconfounded对象相关协会,以获得更准确和强大的视频对象接地。具体而言,我们从视频数据生成过程的角度来看,通过因果干预来学习对象相关关联。为了克服在干预方面缺乏细粒度监督的问题,我们提出了一种新的空间对抗对比学习范式。为了进一步消除对象相关协会内的随附的混杂效果,我们通过通过后门调整进行因果干预来追求真正的因果关系。最后,在统一的因果关系中以端到端的方式在统一的因果框架下学习和优化了Deconfound的对象相关关联。关于IID和OOD测试组的广泛实验,三个基准测试展示了其针对最先进的准确和强大的接地性能。
translated by 谷歌翻译
在本文中,我们以高效统一的方式呈现Grecx,这是一种开源Tensorflow框架,用于基于GNN的推荐模型。 GreCX由核心库组成,用于构建基于GNN的推荐基准,以及基于流行的基于GNN的推荐模型的实现。核心库为构建高效和统一的基准,包括FastMetrics(高效度量计算库),VectorSearch(密集向量的高效相似性搜索库),Batcheval(高效迷你批量评估库)和DataManager(Unified DataSet Management库(Unified DataSet Management库) )。特别是为提供统一的基准测试,以进行不同复杂的基于GNN的推荐模型的公平比较,我们设计了一种新的公制GRMF-X并将其集成到FastMetrics组件中。基于Tensorflow GNN Library TF_Geometric,Grecx仔细实现了各种基于GNN的推荐模型。我们仔细实施了这些基线模型来重现文献中报告的性能,我们的实现通常更有效和友好。总之,GreCX使用来以高效统一的方式培训和基于GNN的推荐基线。我们与Grecx进行实验,实验结果表明,Grecx允许我们以有效和统一的方式培训和基于GNN的推荐基准。 grecx的源代码可在https://github.com/maenzhier/grecx上获得。
translated by 谷歌翻译