可解释的AI(XAI)是一个重要的发展领域,但仍相对研究用于聚类。我们提出了一种可解释的划分聚类方法,不仅可以找到集群,而且还可以解释每个群集。基于典范的心理学概念学院的使用支持了示例示例的理解。我们表明,找到一小部分示例来解释即使是一个群集也是计算上的棘手。因此,总体问题具有挑战性。我们开发了一种近似算法,该算法可为聚类质量以及所使用的示例数量提供可证明的性能保证。该基本算法解释了每个集群中的所有实例,而另一种近似算法则使用有界数的示例来允许更简单的解释,并证明涵盖了所有实例的大部分。实验结果表明,我们的工作在涉及很难理解图像和文本深层嵌入的领域中很有用。
translated by 谷歌翻译
K-MEDIAN和K-MEACE是聚类算法的两个最受欢迎的目标。尽管有密集的努力,但对这些目标的近似性很好地了解,特别是在$ \ ell_p $ -metrics中,仍然是一个重大的开放问题。在本文中,我们在$ \ ell_p $ -metrics中显着提高了文献中已知的近似因素的硬度。我们介绍了一个名为Johnson覆盖假说(JCH)的新假设,这大致断言设定系统上的良好的Max K-Coverage问题难以近似于1-1 / e,即使是成员图形设置系统是Johnson图的子图。然后,我们展示了Cohen-Addad和Karthik引入的嵌入技术的概括(Focs'19),JCH意味着K-MEDIAN和K-MERION在$ \ ell_p $ -metrics中的近似结果的近似值的硬度为近距离对于一般指标获得的人。特别地,假设JCH我们表明很难近似K-Meator目标:$ \ Bullet $离散情况:$ \ ell_1 $ 3.94 - $ \ ell_2中的1.73因素为1.73倍$$ - 这分别在UGC下获得了1.56和1.17的先前因子。 $ \ bullet $持续案例:$ \ ell_1 $ 2210 - $ \ ell_2 $的$ \ ell_1 $ 210。$ \ ell_2 $-metric;这在UGC下获得的$ \ ell_2 $的$ \ ell_2 $的先前因子提高了1.07。对于K-Median目标,我们还获得了类似的改进。此外,我们使用Dinure等人的工作证明了JCH的弱版本。 (Sicomp'05)在超图顶点封面上,恢复Cohen-Addad和Karthik(Focs'19 Focs'19)上面的所有结果(近)相同的不可识别因素,但现在在标准的NP $ \ NEQ $ P假设下(代替UGC)。
translated by 谷歌翻译
Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs. All of the presented theoretical results are extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data.
translated by 谷歌翻译
在本文中,我们提出了一个自然的单个偏好(IP)稳定性的概念,该概念要求每个数据点平均更接近其自身集群中的点,而不是其他群集中的点。我们的概念可以从几个角度的动机,包括游戏理论和算法公平。我们研究了与我们提出的概念有关的几个问题。我们首先表明,确定给定数据集通常允许进行IP稳定的聚类通常是NP-HARD。结果,我们探索了在某些受限度量空间中查找IP稳定聚类的有效算法的设计。我们提出了一种poly Time算法,以在实际线路上找到满足精确IP稳定性的聚类,并有效地算法来找到针对树度量的IP稳定2聚类。我们还考虑放松稳定性约束,即,与其他任何集群相比,每个数据点都不应太远。在这种情况下,我们提供具有不同保证的多时间算法。我们在实际数据集上评估了一些算法和几种标准聚类方法。
translated by 谷歌翻译
$ k $ -means和$ k $ -median集群是强大的无监督机器学习技术。但是,由于对所有功能的复杂依赖性,解释生成的群集分配是挑战性的。 Moshkovitz,Dasgupta,Rashtchian和Frost [ICML 2020]提出了一个优雅的可解释$ K $ -means和$ K $ -Median聚类型号。在此模型中,具有$ k $叶子的决策树提供了集群中的数据的直接表征。我们研究了关于可解释的聚类的两个自然算法问题。 (1)对于给定的群集,如何通过使用$ k $叶的决策树找到“最佳解释”? (2)对于一套给定的点,如何找到一个以美元的决策树,最小化$ k $ -means / median目标的可解释的聚类?要解决第一个问题,我们介绍了一个新的可解释群集模型。我们的型号受到强大统计数据的异常值概念的启发,是以下情况。我们正在寻求少数积分(异常值),其删除使现有的聚类良好可解释。为了解决第二个问题,我们开始研究Moshkovitz等人的模型。从多元复杂性的角度来看。我们严格的算法分析揭示了参数的影响,如数据的输入大小,尺寸,异常值的数量,簇数,近似比,呈现可解释的聚类的计算复杂度。
translated by 谷歌翻译
Originally, tangles were invented as an abstract tool in mathematical graph theory to prove the famous graph minor theorem. In this paper, we showcase the practical potential of tangles in machine learning applications. Given a collection of cuts of any dataset, tangles aggregate these cuts to point in the direction of a dense structure. As a result, a cluster is softly characterized by a set of consistent pointers. This highly flexible approach can solve clustering problems in various setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. The output of our proposed framework is hierarchical and induces the notion of a soft dendrogram, which can help explore the cluster structure of a dataset. The computational complexity of aggregating the cuts is linear in the number of data points. Thus the bottleneck of the tangle approach is to generate the cuts, for which simple and fast algorithms form a sufficient basis. In our paper we construct the algorithmic framework for clustering with tangles, prove theoretical guarantees in various settings, and provide extensive simulations and use cases. Python code is available on github.
translated by 谷歌翻译
内核生存分析模型借助内核函数估计了个体生存分布,该分布衡量了任意两个数据点之间的相似性。可以使用深内核存活模型来学习这种内核函数。在本文中,我们提出了一种名为“生存内核”的新的深内核生存模型,该模型以模型解释和理论分析的方式将大型数据集扩展到大型数据集。具体而言,根据最近开发的训练集压缩方案,用于分类和回归,将培训数据分为簇,称为内核网,我们将其扩展到生存分析设置。在测试时间,每个数据点表示为这些簇的加权组合,每个数据点可以可视化。对于生存核的特殊情况,我们在预测的生存分布上建立了有限样本误差,该误差是最佳的,该误差是最佳的。尽管使用上述内核网络压缩策略可以实现测试时间的可伸缩性,但训练过程中的可伸缩性是通过基于XGBoost(例如Xgboost)的暖启动程序和加速神经建筑搜索的启发式方法来实现的。在三个不同大小的标准生存分析数据集(大约300万个数据点)上,我们表明生存核具有很高的竞争力,并且在一致性指数方面经过测试的最佳基线。我们的代码可在以下网址找到:https://github.com/georgehc/survival-kernets
translated by 谷歌翻译
Multi-label classification is becoming increasingly ubiquitous, but not much attention has been paid to interpretability. In this paper, we develop a multi-label classifier that can be represented as a concise set of simple "if-then" rules, and thus, it offers better interpretability compared to black-box models. Notably, our method is able to find a small set of relevant patterns that lead to accurate multi-label classification, while existing rule-based classifiers are myopic and wasteful in searching rules,requiring a large number of rules to achieve high accuracy. In particular, we formulate the problem of choosing multi-label rules to maximize a target function, which considers not only discrimination ability with respect to labels, but also diversity. Accounting for diversity helps to avoid redundancy, and thus, to control the number of rules in the solution set. To tackle the said maximization problem we propose a 2-approximation algorithm, which relies on a novel technique to sample high-quality rules. In addition to our theoretical analysis, we provide a thorough experimental evaluation, which indicates that our approach offers a trade-off between predictive performance and interpretability that is unmatched in previous work.
translated by 谷歌翻译
We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task? And how can clustering results be validated? Connectivity-based versus prototype-based approaches are reflected in the context of several popular methods: single-linkage, spectral embedding, k-means, and Gaussian mixtures are discussed as well as the density-based protocols (H)DBSCAN, Jarvis-Patrick, CommonNN, and density-peaks.
translated by 谷歌翻译
A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.
translated by 谷歌翻译
组合优化是运营研究和计算机科学领域的一个公认领域。直到最近,它的方法一直集中在孤立地解决问题实例,而忽略了它们通常源于实践中的相关数据分布。但是,近年来,人们对使用机器学习,尤其是图形神经网络(GNN)的兴趣激增,作为组合任务的关键构件,直接作为求解器或通过增强确切的求解器。GNN的电感偏差有效地编码了组合和关系输入,因为它们对排列和对输入稀疏性的意识的不变性。本文介绍了对这个新兴领域的最新主要进步的概念回顾,旨在优化和机器学习研究人员。
translated by 谷歌翻译
我们重新审视了Chierichetti等人首先引入的公平聚类问题,该问题要求每个受保护的属性在每个集群中具有近似平等的表示。即,余额财产。现有的公平聚类解决方案要么是不可扩展的,要么无法在聚类目标和公平之间实现最佳权衡。在本文中,我们提出了一种新的公平概念,我们称之为$ tau $ $ $ - fair公平,严格概括了余额财产,并实现了良好的效率与公平折衷。此外,我们表明,简单的基于贪婪的圆形算法有效地实现了这一权衡。在更一般的多价受保护属性的设置下,我们严格地分析了算法的理论特性。我们的实验结果表明,所提出的解决方案的表现优于所有最新算法,即使对于大量簇,也可以很好地工作。
translated by 谷歌翻译
我们启动了一项全面的实验研究,对大量数据集的基于目标的层次聚类方法,该方法包括来自计算机视觉和NLP应用程序的深层嵌入向量。这包括各种各样的图像嵌入(Imagenet,ImagenetV2,nabirds),单词嵌入(Twitter,Wikipedia)和句子嵌入(SST-2)载体(例如,Resnet,Resnext,Insnext,Inception V3,Sbert,Sbert)中的句子嵌入(SST-2)矢量。我们的研究包括最高45万美元的条目的数据集,其嵌入式尺寸高达2048美元。为了解决将层次聚类扩展到如此大的数据集的挑战,我们提出了一种新的实用层次聚类算法B ++&c。流行的Moseley-Wang(MW) / Cohen-Addad等人的平均可获得5% / 20%的提高。 (CKMM)目标(归一化)与广泛的经典方法和最近的启发式方法相比。我们还引入了一种理论算法B2SAT&C,该算法在多项式时间内实现了CKMM目标的0.74 $ approximation。这是对随机二进制树实现的微不足道$ 2/3 $ - approximation的首次实质性改进。在这项工作之前,$ \ $ \ 2/3 + 0.0004 $的最佳聚时近似是由于Charikar等人。 (Soda'19)。
translated by 谷歌翻译
Deep active learning aims to reduce the annotation cost for the training of deep models, which is notoriously data-hungry. Until recently, deep active learning methods were ineffectual in the low-budget regime, where only a small number of examples are annotated. The situation has been alleviated by recent advances in representation and self-supervised learning, which impart the geometry of the data representation with rich information about the points. Taking advantage of this progress, we study the problem of subset selection for annotation through a "covering" lens, proposing ProbCover - a new active learning algorithm for the low budget regime, which seeks to maximize Probability Coverage. We then describe a dual way to view the proposed formulation, from which one can derive strategies suitable for the high budget regime of active learning, related to existing methods like Coreset. We conclude with extensive experiments, evaluating ProbCover in the low-budget regime. We show that our principled active learning strategy improves the state-of-the-art in the low-budget regime in several image recognition benchmarks. This method is especially beneficial in the semi-supervised setting, allowing state-of-the-art semi-supervised methods to match the performance of fully supervised methods, while using much fewer labels nonetheless. Code is available at https://github.com/avihu111/TypiClust.
translated by 谷歌翻译
异常和异常值检测是机器学习中的长期问题。在某些情况下,异常检测容易,例如当从诸如高斯的良好特征的分布中抽出数据时。但是,当数据占据高维空间时,异常检测变得更加困难。我们呈现蛤蜊(聚类学习近似歧管),是任何度量空间中的歧管映射技术。 CLAM以快速分层聚类技术开始,然后根据使用多个几何和拓扑功能所选择的重叠群集,从群集树中引导图表。使用这些图形,我们实现了Chaoda(群集分层异常和异常值检测算法),探索了图形的各种属性及其组成集群以查找异常值。 Chaoda采用了一种基于培训数据集的转移学习形式,并将这些知识应用于不同基数,维度和域的单独测试集。在24个公开可用的数据集上,我们将Chaoda(按衡量ROC AUC)与各种最先进的无监督异常检测算法进行比较。六个数据集用于培训。 Chaoda优于16个剩余的18个数据集的其他方法。 CLAM和Chaoda规模大,高维“大数据”异常检测问题,并贯穿数据集和距离函数。克拉姆和Chaoda的源代码在github上自由地提供https://github.com/uri-abd/clam。
translated by 谷歌翻译
The design of good heuristics or approximation algorithms for NP-hard combinatorial optimization problems often requires significant specialized knowledge and trial-and-error. Can we automate this challenging, tedious process, and learn the algorithms instead? In many real-world applications, it is typically the case that the same optimization problem is solved again and again on a regular basis, maintaining the same problem structure but differing in the data. This provides an opportunity for learning heuristic algorithms that exploit the structure of such recurring problems. In this paper, we propose a unique combination of reinforcement learning and graph embedding to address this challenge. The learned greedy policy behaves like a meta-algorithm that incrementally constructs a solution, and the action is determined by the output of a graph embedding network capturing the current state of the solution. We show that our framework can be applied to a diverse range of optimization problems over graphs, and learns effective algorithms for the Minimum Vertex Cover, Maximum Cut and Traveling Salesman problems.
translated by 谷歌翻译
在使用提供明确定义的隐私保证的用户数据时,至关重要。在这项工作中,我们旨在与第三方私下操纵和分享整个稀疏数据集。实际上,差异隐私已成为隐私的黄金标准,但是,当涉及到稀疏数据集时,作为我们的主要结果之一,我们证明\ emph {any}与最初的私人机制有差异化的私人机制数据集注定要拥有非常薄弱的隐私保证。因此,我们需要选择其他隐私概念,例如$ k $ - 匿名性更好地在这种情况下保存实用程序。在这项工作中,我们介绍了$ k $ - 匿名的变体,我们称之为平滑$ k $ - 匿名和设计简单算法,可有效地提供平滑的$ k $ - 匿名性。我们进一步执行经验评估以支持我们的理论保证,并表明我们的算法改善了匿名数据下游机器学习任务的性能。
translated by 谷歌翻译
随着机器学习变得普遍,减轻培训数据中存在的任何不公平性变得至关重要。在公平的各种概念中,本文的重点是众所周知的个人公平,该公平规定应该对类似的人进行类似的对待。虽然在训练模型(对处理)时可以提高个人公平性,但我们认为在模型培训(预处理)之前修复数据是一个更基本的解决方案。特别是,我们表明标签翻转是改善个人公平性的有效预处理技术。我们的系统IFLIPPER解决了限制了个人公平性违规行为的最小翻转标签的优化问题,当培训数据中的两个类似示例具有不同的标签时,发生违规情况。我们首先证明问题是NP-HARD。然后,我们提出了一种近似的线性编程算法,并提供理论保证其结果与标签翻转数量有关的结果与最佳解决方案有多近。我们还提出了使线性编程解决方案更加最佳的技术,而不会超过违规限制。实际数据集上的实验表明,在看不见的测试集的个人公平和准确性方面,IFLIPPER显着优于其他预处理基线。此外,IFLIPPER可以与处理中的技术结合使用,以获得更好的结果。
translated by 谷歌翻译
In the Priority $k$-Center problem, the input consists of a metric space $(X,d)$, an integer $k$, and for each point $v \in X$ a priority radius $r(v)$. The goal is to choose $k$-centers $S \subseteq X$ to minimize $\max_{v \in X} \frac{1}{r(v)} d(v,S)$. If all $r(v)$'s are uniform, one obtains the $k$-Center problem. Plesn\'ik [Plesn\'ik, Disc. Appl. Math. 1987] introduced the Priority $k$-Center problem and gave a $2$-approximation algorithm matching the best possible algorithm for $k$-Center. We show how the problem is related to two different notions of fair clustering [Harris et al., NeurIPS 2018; Jung et al., FORC 2020]. Motivated by these developments we revisit the problem and, in our main technical contribution, develop a framework that yields constant factor approximation algorithms for Priority $k$-Center with outliers. Our framework extends to generalizations of Priority $k$-Center to matroid and knapsack constraints, and as a corollary, also yields algorithms with fairness guarantees in the lottery model of Harris et al [Harris et al, JMLR 2019].
translated by 谷歌翻译
多样性最大化是数据汇总,Web搜索和推荐系统中广泛应用的基本问题。给定$ n $元素的$ x $元素,它要求选择一个$ k \ ll n $元素的子集$ s $,具有最大\ emph {多样性},这是由$ s $中元素之间的差异量化的。在本文中,我们关注流媒体环境中公平限制的多样性最大化问题。具体而言,我们考虑了最大值的多样性目标,该目标选择了一个子集$ s $,该子集$ s $最大化了其中任何一对不同元素之间的最小距离(不同)。假设集合$ x $通过某些敏感属性(例如性别或种族)将$ m $ discoint组分为$ m $ discoint组,确保\ emph {fairness}要求所选的子集$ s $包含每个组$ i的$ k_i $ e元素\在[1,m] $中。流算法应在一个通过中顺序处理$ x $,并返回具有最大\ emph {多样性}的子集,同时保证公平约束。尽管对多样性的最大化进行了广泛的研究,但唯一可以与最大值多样性目标和公平性约束的唯一已知算法对数据流非常低效。由于多样性最大化通常是NP-HARD,因此我们提出了两个在数据流中最大化的公平多样性的近似算法,其中第一个是$ \ frac {1- \ varepsilon} {4} {4} $ - 近似于$ m = 2 $,其中$ \ varepsilon \ in(0,1)$,第二个实现了$ \ frac {1- \ varepsilon} {3m+2} $ - 任意$ m $的近似值。现实世界和合成数据集的实验结果表明,两种算法都提供了与最新算法相当的质量解决方案,同时在流式设置中运行多个数量级。
translated by 谷歌翻译