Text clustering and topic extraction are two important tasks in text mining. Usually, these two tasks are performed separately. For topic extraction to facilitate clustering, we can first project texts into a topic space and then perform a clustering algorithm to obtain clusters. To promote topic extraction by clustering, we can first obtain clusters with a clustering algorithm and then extract cluster-specific topics. However, this naive strategy ignores the fact that text clustering and topic extraction are strongly correlated and follow a chicken-and-egg relationship. Performing them separately fails to make them mutually benefit each other to achieve the best overall performance. In this paper, we propose an unsupervised text clustering and topic extraction framework (ClusTop) which integrates text clustering and topic extraction into a unified framework and can achieve high-quality clustering result and extract topics from each cluster simultaneously. Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction, where the enhanced language model can be viewed as a bridge between clustering and topic extraction. On one hand, it provides text embeddings with a strong cluster structure which facilitates effective text clustering; on the other hand, it pays high attention on the topic related words for topic extraction because of its self-attention architecture. Moreover, the training of enhanced language model is unsupervised. Experiments on two datasets demonstrate the effectiveness of our framework and provide benchmarks for different model combinations in this framework.
translated by 谷歌翻译
尽管基于3D点云表示的基于自我监督的对比度学习模型最近取得了成功,但此类预训练模型的对抗性鲁棒性引起了人们的关注。对抗性对比学习(ACL)被认为是改善预训练模型的鲁棒性的有效方法。相比之下,投影仪被认为是在对比度预处理过程中删除不必要的特征信息的有效组成部分,并且大多数ACL作品还使用对比度损失,与预测的功能表示形式相比损失,在预处理中产生对抗性示例,而“未转移”的功能表征用于发电的对抗性输入。在推理期间。由于投影和“未投影”功能之间的分布差距,其模型受到限制,以获取下游任务的可靠特征表示。我们介绍了一种新方法,通过利用虚拟对抗性损失在对比度学习框架中使用“未重新注射”功能表示,以生成高质量的3D对抗示例,以进行对抗训练。我们介绍了强大的意识损失功能,以对抗自我监督对比度学习框架。此外,我们发现选择具有正常操作员(DON)操作员差异的高差异作为对抗性自学对比度学习的附加输入,可以显着提高预训练模型的对抗性鲁棒性。我们在下游任务上验证我们的方法,包括3D分类和使用多个数据集的3D分割。它在最先进的对抗性学习方法上获得了可比的鲁棒精度。
translated by 谷歌翻译
我们解决了一对图像之间找到密集的视觉对应关系的重要任务。由于各种因素,例如质地差,重复的模式,照明变化和运动模糊,这是一个具有挑战性的问题。与使用密集信号基础真相作为本地功能匹配培训的直接监督的方法相反,我们训练3DG-STFM:一种多模式匹配模型(教师),以在3D密集的对应性监督下执行深度一致性,并将知识转移到2D单峰匹配模型(学生)。教师和学生模型均由两个基于变压器的匹配模块组成,这些模块以粗略的方式获得密集的对应关系。教师模型指导学生模型学习RGB诱导的深度信息,以实现粗糙和精细分支的匹配目的。我们还在模型压缩任务上评估了3DG-STFM。据我们所知,3DG-STFM是第一种用于本地功能匹配任务的学生教师学习方法。该实验表明,我们的方法优于室内和室外摄像头姿势估计以及同型估计问题的最先进方法。代码可在以下网址获得:https://github.com/ryan-prime/3dg-stfm。
translated by 谷歌翻译
我们研究机器学习中的\ emph {分类器derandomization}的问题:给定一个随机二进制分类器$ f:x \ to [0,1] $,示例确定性分类器$ \ hat {f} ,1 \} $在任何数据分发上近似$ f $的输出。最近的工作揭示了如何有效地降低具有强大输出近似保证的随机分类器,但以个人公平为代价 - 也就是说,如果$ f $处理过类似的输入,则$ \ hat {f} $没有。在本文中,我们启动了对分类器衍生物的系统研究,并提供了公平保证。我们表明,先前的降低方法几乎是最大的度量 - ``随机阈值''的简单``derandomization''可实现最佳公平性,但输出近似较弱。然后,我们设计了一个降低的程序,该程序在这两个之间提供了一个有吸引力的权衡:如果$ f $是$ \ alpha $ - metric博览会,根据度量$ d $,带有局部敏感的哈希(LSH)家族,则是我们的贬低$ \ \ \ \ \ \ \ \ \ hat {f} $具有很高的概率,$ o(\ alpha)$ - 公平级别和$ f $的近似值。我们还证明了适用于所有(公平和不公平的)分类器降低程序的通用结果,包括偏置方差分解和各种度量公平概念之间的降低。
translated by 谷歌翻译
Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shifts. Such shifts may be introduced by initial training data uncertainties, user adaptation to a deployed predictor, dynamic environments, or the use of pre-trained models in new settings. Herein, we develop a bound that characterizes such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violations as our primary result. We then develop bounds for specific worked examples, focusing on two commonly used fairness definitions (i.e., demographic parity and equalized odds) and two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift and against real-world data, finding that we are able to estimate fairness violation bounds in practice, even when simplifying assumptions are only approximately satisfied.
translated by 谷歌翻译
本文研究了当人类决策受试者对部署的机器学习模型做出反应时的转让性。在我们的设置中,代理或用户对应于从分发$ \ Mathcal {d} $中绘制的示例$(x,y)$,并将面对型号$ h $,其分类结果$ h(x)$。代理商可以修改$ x $以适应$ h $,这将导致$(x,y)$的分销变化。因此,当培训$ H $时,学习者将需要考虑部署输出模型时随后的``诱发''分布。我们的表述是由部署的机器学习模型与人类代理相互作用的应用程序的动机,并最终将面临响应式和交互式数据分布。我们通过研究如何在可用源分布(数据)上训练的模型将模型的可传递性进行正式讨论,将转化为诱导域的性能。由于诱导的域移位,我们为性能差距提供了上限,以及分类器必须在源训练分布或诱导的目标分布上遭受的权衡方面的下限。我们为两个流行的域适应设置提供了进一步的实例化分析,并具有协变量转移和目标转移。
translated by 谷歌翻译
在进化多目标优化领域,决策者(DM)涉及相互冲突的目标。在现实世界中,通常存在多个DM,每个DM都涉及这些目标的一部分。提出了多方多目标优化问题(MPMOPS)来描绘拖把,其中涉及多个决策者,每个方都关注所有目标的某些目标。但是,在进化计算字段中,对mpmops的关注不多。本文基于距离最小化问题(DMP)构建了一系列MPMOP,它们的Pareto最佳解决方案可以生动地可视化。为了解决MPMOPS,新提出的算法OPTMPNDS3使用多方初始化方法来初始化总体,并带Jade2操作员生成后代。在问题套件上,将OPTMPNDS3与Optall,OptMPND和OptMPNDS2进行了比较。结果表明OPTMPNDS3与其他算法具有很强的可比性
translated by 谷歌翻译
人工神经网络(ANN)训练景观的非凸起带来了固有的优化困难。虽然传统的背传播随机梯度下降(SGD)算法及其变体在某些情况下是有效的,但它们可以陷入杂散的局部最小值,并且对初始化和普通公共表敏感。最近的工作表明,随着Relu激活的ANN的培训可以重新重整为凸面计划,使希望能够全局优化可解释的ANN。然而,天真地解决凸训练制剂具有指数复杂性,甚至近似启发式需要立方时间。在这项工作中,我们描述了这种近似的质量,并开发了两个有效的算法,这些算法通过全球收敛保证培训。第一算法基于乘法器(ADMM)的交替方向方法。它解决了精确的凸形配方和近似对应物。实现线性全局收敛,并且初始几次迭代通常会产生具有高预测精度的解决方案。求解近似配方时,每次迭代时间复杂度是二次的。基于“采样凸面”理论的第二种算法更简单地实现。它解决了不受约束的凸形制剂,并收敛到大约全球最佳的分类器。当考虑对抗性培训时,ANN训练景观的非凸起加剧了。我们将稳健的凸优化理论应用于凸训练,开发凸起的凸起制剂,培训Anns对抗对抗投入。我们的分析明确地关注一个隐藏层完全连接的ANN,但可以扩展到更复杂的体系结构。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译