为了关注稳定的室友(SR)实例,我们为进行稳定匹配问题的实验的工具箱做出了贡献。我们引入了一个多项式时间可计算的伪计,以测量SR实例的相似性,分析其属性并使用它来创建SR实例的地图。该地图可视化460个合成SR实例(每个统计培养物之一中的一个采样),如下所示:每个实例都是平面中的一个点,如果相应的SR实例彼此相似,则在地图上有两个点接近。随后,我们进行了几个模范实验,并在地图上描述了它们的结果,说明了地图作为非聚集可视化工具的有用性,生成的数据集的多样性以及使用从不同统计文化中采样的实例。最后,为了证明我们的框架也可以用于偏爱的其他匹配问题,我们创建和分析了稳定的婚姻实例地图。
translated by 谷歌翻译
For centuries, it has been widely believed that the influence of a small coalition of voters is negligible in a large election. Consequently, there is a large body of literature on characterizing the asymptotic likelihood for an election to be influenced, especially by the manipulation of a single voter, establishing an $O(\frac{1}{\sqrt n})$ upper bound and an $\Omega(\frac{1}{n^{67}})$ lower bound for many commonly studied voting rules under the i.i.d.~uniform distribution, known as Impartial Culture (IC) in social choice, where $n$ is the number is voters. In this paper, we extend previous studies in three aspects: (1) we consider a more general and realistic semi-random model, where a distribution adversary chooses a worst-case distribution and then a data adversary modifies up to $\psi$ portion of the data, (2) we consider many coalitional influence problems, including coalitional manipulation, margin of victory, and various vote controls and bribery, and (3) we consider arbitrary and variable coalition size $B$. Our main theorem provides asymptotically tight bounds on the semi-random likelihood of the existence of a size-$B$ coalition that can successfully influence the election under a wide range of voting rules. Applications of the main theorem and its proof techniques resolve long-standing open questions about the likelihood of coalitional manipulability under IC, by showing that the likelihood is $\Theta\left(\min\left\{\frac{B}{\sqrt n}, 1\right\}\right)$ for many commonly studied voting rules. The main technical contribution is a characterization of the semi-random likelihood for a Poisson multinomial variable (PMV) to be unstable, which we believe to be a general and useful technique with independent interest.
translated by 谷歌翻译
最近关于Littmann的报告[Comment。 ACM'21]概述了学术同行评论中勾结环的存在和致命影响。我们介绍和分析了问题周期的审查,该审查旨在找到审查任务,没有以下勾结戒指:一系列审稿人员每次审查下一个审阅者在序列中撰写的文件(与最后审查员审查一篇论文第一个),从而创建一个审查周期,每个审界者都提供了有利的评论。因此,该循环中的所有文件都有很大的接受机会与各自的科学优点无关。我们观察到,使用标准线性编程方法计算的审核分配通常允许许多短审查周期。在消极方面,我们表明,在各种限制性案件中,无期临时审查是NP - 困难(即,当每个作者有资格审查所有论文时,人们想要防止作者互相审查或他们自己的论文或每篇作者何时审查只有一篇论文,只有有资格审查几篇论文)。在积极的方面,除了其他方面,我们表明,在一些现实的设置中,没有任何审查周期的分配总是存在。这一结果也引发了用于计算(加权)自由审查任务的有效启发式,我们在实践中表现出优良的品质。
translated by 谷歌翻译
包括推荐系统在内的许多深度学习系统学习,以矢量嵌入形式的偏好空间模型都学到了。通常,这些模型被假定近似于欧几里得结构,在该结构中,个人更喜欢替代品更接近其“理想点”,如欧几里得度量所测量。但是,Bogomolnaia和Laslier(2007)表明,如果欧几里得空间的维度少于个人或替代方案,则存在该结构的序着偏好曲线。我们扩展了这一结果,表明存在一些现实情况,其中几乎所有的偏好概况都不能用欧几里得模型表示,并在与欧几里得模型近似于不说明的偏好时,在丢失的信息上得出了理论下的下限。我们的结果对载体嵌入的解释和使用具有影响,因为在某些情况下,只有在嵌入的维度是个体数量或替代方案的很大一部分时,才可能接近任意偏好。
translated by 谷歌翻译
Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs. All of the presented theoretical results are extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data.
translated by 谷歌翻译
The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences.This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds.The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed.
translated by 谷歌翻译
随机块模型(SBM)是一个随机图模型,其连接不同的顶点组不同。它被广泛用作研究聚类和社区检测的规范模型,并提供了肥沃的基础来研究组合统计和更普遍的数据科学中出现的信息理论和计算权衡。该专着调查了最近在SBM中建立社区检测的基本限制的最新发展,无论是在信息理论和计算方案方面,以及各种恢复要求,例如精确,部分和弱恢复。讨论的主要结果是在Chernoff-Hellinger阈值中进行精确恢复的相转换,Kesten-Stigum阈值弱恢复的相变,最佳的SNR - 单位信息折衷的部分恢复以及信息理论和信息理论之间的差距计算阈值。该专着给出了在寻求限制时开发的主要算法的原则推导,特别是通过绘制绘制,半定义编程,(线性化)信念传播,经典/非背带频谱和图形供电。还讨论了其他块模型的扩展,例如几何模型和一些开放问题。
translated by 谷歌翻译
K-MEDIAN和K-MEACE是聚类算法的两个最受欢迎的目标。尽管有密集的努力,但对这些目标的近似性很好地了解,特别是在$ \ ell_p $ -metrics中,仍然是一个重大的开放问题。在本文中,我们在$ \ ell_p $ -metrics中显着提高了文献中已知的近似因素的硬度。我们介绍了一个名为Johnson覆盖假说(JCH)的新假设,这大致断言设定系统上的良好的Max K-Coverage问题难以近似于1-1 / e,即使是成员图形设置系统是Johnson图的子图。然后,我们展示了Cohen-Addad和Karthik引入的嵌入技术的概括(Focs'19),JCH意味着K-MEDIAN和K-MERION在$ \ ell_p $ -metrics中的近似结果的近似值的硬度为近距离对于一般指标获得的人。特别地,假设JCH我们表明很难近似K-Meator目标:$ \ Bullet $离散情况:$ \ ell_1 $ 3.94 - $ \ ell_2中的1.73因素为1.73倍$$ - 这分别在UGC下获得了1.56和1.17的先前因子。 $ \ bullet $持续案例:$ \ ell_1 $ 2210 - $ \ ell_2 $的$ \ ell_1 $ 210。$ \ ell_2 $-metric;这在UGC下获得的$ \ ell_2 $的$ \ ell_2 $的先前因子提高了1.07。对于K-Median目标,我们还获得了类似的改进。此外,我们使用Dinure等人的工作证明了JCH的弱版本。 (Sicomp'05)在超图顶点封面上,恢复Cohen-Addad和Karthik(Focs'19 Focs'19)上面的所有结果(近)相同的不可识别因素,但现在在标准的NP $ \ NEQ $ P假设下(代替UGC)。
translated by 谷歌翻译
Originally, tangles were invented as an abstract tool in mathematical graph theory to prove the famous graph minor theorem. In this paper, we showcase the practical potential of tangles in machine learning applications. Given a collection of cuts of any dataset, tangles aggregate these cuts to point in the direction of a dense structure. As a result, a cluster is softly characterized by a set of consistent pointers. This highly flexible approach can solve clustering problems in various setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. The output of our proposed framework is hierarchical and induces the notion of a soft dendrogram, which can help explore the cluster structure of a dataset. The computational complexity of aggregating the cuts is linear in the number of data points. Thus the bottleneck of the tangle approach is to generate the cuts, for which simple and fast algorithms form a sufficient basis. In our paper we construct the algorithmic framework for clustering with tangles, prove theoretical guarantees in various settings, and provide extensive simulations and use cases. Python code is available on github.
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
translated by 谷歌翻译
We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task? And how can clustering results be validated? Connectivity-based versus prototype-based approaches are reflected in the context of several popular methods: single-linkage, spectral embedding, k-means, and Gaussian mixtures are discussed as well as the density-based protocols (H)DBSCAN, Jarvis-Patrick, CommonNN, and density-peaks.
translated by 谷歌翻译
参与式预算(PB)最近由于其在社会选择环境中的广泛适用性而引起了很多关注。在本文中,我们认为不可分割的PB涉及将可用的,有限的预算分配给一组不可分割的项目,每个项目都根据代理商而不是项目的偏好而有一定的成本。我们在本文中解决的具体,重要的研究差距是提出针对排名较弱(即弱的顺序偏好)的不可分割的PB规则类别,并研究其关键算法和公理问题。我们提出了两类规则具有不同意义和动力的规则。第一个是分层的批准规则,可以通过将其仔细地转化为批准票来研究弱排名。第二个是基于需求的规则,可以捕获公平性问题。根据分层的批准规则,我们研究了两个自然的规则家庭:贪婪的结局规则和价值价值的规则。该纸有两个部分。在第一部分中,我们研究了拟议规则的算法和复杂性问题。在第二部分中,我们对这些规则进行了详细的公理分析,为此,我们在文献中检查和概括了公理,并引入了新的公理,促销性。该论文有助于强调这些规则的实际吸引力,计算复杂性和公理合规性之间的权衡。
translated by 谷歌翻译
Suppose we are given an $n$-dimensional order-3 symmetric tensor $T \in (\mathbb{R}^n)^{\otimes 3}$ that is the sum of $r$ random rank-1 terms. The problem of recovering the rank-1 components is possible in principle when $r \lesssim n^2$ but polynomial-time algorithms are only known in the regime $r \ll n^{3/2}$. Similar "statistical-computational gaps" occur in many high-dimensional inference tasks, and in recent years there has been a flurry of work on explaining the apparent computational hardness in these problems by proving lower bounds against restricted (yet powerful) models of computation such as statistical queries (SQ), sum-of-squares (SoS), and low-degree polynomials (LDP). However, no such prior work exists for tensor decomposition, largely because its hardness does not appear to be explained by a "planted versus null" testing problem. We consider a model for random order-3 tensor decomposition where one component is slightly larger in norm than the rest (to break symmetry), and the components are drawn uniformly from the hypercube. We resolve the computational complexity in the LDP model: $O(\log n)$-degree polynomial functions of the tensor entries can accurately estimate the largest component when $r \ll n^{3/2}$ but fail to do so when $r \gg n^{3/2}$. This provides rigorous evidence suggesting that the best known algorithms for tensor decomposition cannot be improved, at least by known approaches. A natural extension of the result holds for tensors of any fixed order $k \ge 3$, in which case the LDP threshold is $r \sim n^{k/2}$.
translated by 谷歌翻译
我们考虑了顺序评估的问题,在该问题中,评估者以序列观察候选人,并以在线,不可撤销的方式为这些候选人分配分数。受到在这种环境中研究顺序偏见的心理学文献的激励 - 即,评估结果与候选人出现的顺序之间的依赖性 - 我们为评估者的评级过程提出了一个自然模型,该模型捕获了缺乏固有的校准固有的校准这样的任务。我们进行众包实验,以展示模型的各个方面。然后,我们开始研究如何通过将其作为统计推断问题来纠正模型下的顺序偏差。我们提出了一个接近线性的时间,在线算法,以确保两个规范的排名指标可以保证。我们还通过在两个指标中建立匹配的下限来证明我们的算法在理论上是最佳信息。最后,我们表明我们的算法优于使用报告得分引起的排名的事实上的方法。
translated by 谷歌翻译
Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.
translated by 谷歌翻译
我们专注于一个在线2阶段问题,以以下情况进行:考虑一个应分配给大学的系统。在第一轮中,一些学生申请了一些学生,必须计算出第一个(稳定的)匹配$ m_1 $。但是,一些学生可能会决定离开系统(更改计划,去外国大学或不在系统中的某些机构)。然后,在第二轮(在这些删除之后)中,我们将计算第二个(最终)稳定的匹配$ m_2 $。由于不希望更改作业,因此目标是最大程度地减少两个稳定匹配$ m_1 $和$ m_2 $之间的离婚/修改数量。那么,我们应该如何选择$ m_1 $和$ m_2 $?我们表明,有一个{\ it Optival Online}算法可以解决此问题。特别是,由于具有优势属性,我们表明我们可以最佳地计算$ M_1 $,而无需知道会离开系统的学生。我们将结果概括为输入中的其他一些可能的修改(学生,开放位置)。我们还解决了更多阶段的情况,表明在有3个阶段后,就无法实现竞争性(在线)算法。
translated by 谷歌翻译
$ N $ -Quens配置是$ N \ Times N $ Chessboard的$ N $相互非攻击座位的位置。Nauck在1850年介绍的$ N $ -Queens完井问题是决定是否可以将给定的部分配置完成为$ N $ -Queens配置。在本文中,我们研究了这个问题的极端方面,即:部分配置必须小心,以便完成完成?我们表明,可以完成任何最多$ N / 60 $相互非攻击Queens的展示。我们还提供了大约N / 4 $ Queens的部分配置,不能完成,并制定一些有趣的问题。我们的证据将Queens问题与二角形图中的彩虹匹配连接,并使用概率参数以及线性编程二元性。
translated by 谷歌翻译
我们提出了改进的算法,并为身份测试$ n $维分布的问题提供了统计和计算下限。在身份测试问题中,我们将作为输入作为显式分发$ \ mu $,$ \ varepsilon> 0 $,并访问对隐藏分布$ \ pi $的采样甲骨文。目标是区分两个分布$ \ mu $和$ \ pi $是相同的还是至少$ \ varepsilon $ -far分开。当仅从隐藏分布$ \ pi $中访问完整样本时,众所周知,可能需要许多样本,因此以前的作品已经研究了身份测试,并额外访问了各种有条件采样牙齿。我们在这里考虑一个明显弱的条件采样甲骨文,称为坐标Oracle,并在此新模型中提供了身份测试问题的相当完整的计算和统计表征。我们证明,如果一个称为熵的分析属性为可见分布$ \ mu $保留,那么对于任何使用$ \ tilde {o}(n/\ tilde {o}),有一个有效的身份测试算法Varepsilon)$查询坐标Oracle。熵的近似张力是一种经典的工具,用于证明马尔可夫链的最佳混合时间边界用于高维分布,并且最近通过光谱独立性为许多分布族建立了最佳的混合时间。我们将算法结果与匹配的$ \ omega(n/\ varepsilon)$统计下键进行匹配的算法结果补充,以供坐标Oracle下的查询数量。我们还证明了一个计算相变:对于$ \ {+1,-1,-1 \}^n $以上的稀疏抗抗铁磁性模型,在熵失败的近似张力失败的状态下,除非RP = np,否则没有有效的身份测试算法。
translated by 谷歌翻译
本文向许多受访者调查了同时的偏好和度量学习。一组由$ d $二维功能向量和表格的配对比较``项目$ i $都比item $ j $更可取'的项目。我们的模型共同学习了一个距离指标,该指标表征了人群对项目相似性的一般度量,以及每个用户反映其个人喜好的潜在理想点。该模型具有捕获个人喜好的灵活性,同时享受在人群中摊销的度量学习样本成本。我们首先以无声的,连续的响应设置(即等于项目距离的差异)来研究这个问题,以了解学习的基本限制。接下来,我们建立了嘈杂的预测错误保证,可以从人类受访者那里收集诸如二进制测量值,并显示样品复杂性在基础度量较低时如何提高。最后,我们根据响应分布的假设建立恢复保证。我们在模拟数据和大量用户的颜色偏好判断数据集上演示了模型的性能。
translated by 谷歌翻译