Text clustering and topic extraction are two important tasks in text mining. Usually, these two tasks are performed separately. For topic extraction to facilitate clustering, we can first project texts into a topic space and then perform a clustering algorithm to obtain clusters. To promote topic extraction by clustering, we can first obtain clusters with a clustering algorithm and then extract cluster-specific topics. However, this naive strategy ignores the fact that text clustering and topic extraction are strongly correlated and follow a chicken-and-egg relationship. Performing them separately fails to make them mutually benefit each other to achieve the best overall performance. In this paper, we propose an unsupervised text clustering and topic extraction framework (ClusTop) which integrates text clustering and topic extraction into a unified framework and can achieve high-quality clustering result and extract topics from each cluster simultaneously. Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction, where the enhanced language model can be viewed as a bridge between clustering and topic extraction. On one hand, it provides text embeddings with a strong cluster structure which facilitates effective text clustering; on the other hand, it pays high attention on the topic related words for topic extraction because of its self-attention architecture. Moreover, the training of enhanced language model is unsupervised. Experiments on two datasets demonstrate the effectiveness of our framework and provide benchmarks for different model combinations in this framework.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
我们研究机器学习中的\ emph {分类器derandomization}的问题:给定一个随机二进制分类器$ f:x \ to [0,1] $,示例确定性分类器$ \ hat {f} ,1 \} $在任何数据分发上近似$ f $的输出。最近的工作揭示了如何有效地降低具有强大输出近似保证的随机分类器,但以个人公平为代价 - 也就是说,如果$ f $处理过类似的输入,则$ \ hat {f} $没有。在本文中,我们启动了对分类器衍生物的系统研究,并提供了公平保证。我们表明,先前的降低方法几乎是最大的度量 - ``随机阈值''的简单``derandomization''可实现最佳公平性,但输出近似较弱。然后,我们设计了一个降低的程序,该程序在这两个之间提供了一个有吸引力的权衡:如果$ f $是$ \ alpha $ - metric博览会,根据度量$ d $,带有局部敏感的哈希(LSH)家族,则是我们的贬低$ \ \ \ \ \ \ \ \ \ hat {f} $具有很高的概率,$ o(\ alpha)$ - 公平级别和$ f $的近似值。我们还证明了适用于所有(公平和不公平的)分类器降低程序的通用结果,包括偏置方差分解和各种度量公平概念之间的降低。
translated by 谷歌翻译
Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shifts. Such shifts may be introduced by initial training data uncertainties, user adaptation to a deployed predictor, dynamic environments, or the use of pre-trained models in new settings. Herein, we develop a bound that characterizes such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violations as our primary result. We then develop bounds for specific worked examples, focusing on two commonly used fairness definitions (i.e., demographic parity and equalized odds) and two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift and against real-world data, finding that we are able to estimate fairness violation bounds in practice, even when simplifying assumptions are only approximately satisfied.
translated by 谷歌翻译
本文研究了当人类决策受试者对部署的机器学习模型做出反应时的转让性。在我们的设置中,代理或用户对应于从分发$ \ Mathcal {d} $中绘制的示例$(x,y)$,并将面对型号$ h $,其分类结果$ h(x)$。代理商可以修改$ x $以适应$ h $,这将导致$(x,y)$的分销变化。因此,当培训$ H $时,学习者将需要考虑部署输出模型时随后的``诱发''分布。我们的表述是由部署的机器学习模型与人类代理相互作用的应用程序的动机,并最终将面临响应式和交互式数据分布。我们通过研究如何在可用源分布(数据)上训练的模型将模型的可传递性进行正式讨论,将转化为诱导域的性能。由于诱导的域移位,我们为性能差距提供了上限,以及分类器必须在源训练分布或诱导的目标分布上遭受的权衡方面的下限。我们为两个流行的域适应设置提供了进一步的实例化分析,并具有协变量转移和目标转移。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译