差异化私有(DP)数据发布是一种有前途的技术,可以在不损害数据主体的隐私而传播数据。但是,大多数先前的工作都集中在单一方拥有所有数据的方案上。在本文中,我们专注于多方设置,其中不同的利益相关者拥有属于同一数据主体的属性集合。在线性回归的上下文中,允许各方在完全数据上训练模型,而无需推断个人的私人属性或身份,我们首先直接应用高斯机制并表明其具有小的特征值问题。我们进一步提出了我们的新方法,并证明其渐近地收敛到随着数据集大小增加的最佳(非私有)解决方案。我们通过对人工和现实世界数据集的实验来证实理论结果。
translated by 谷歌翻译
我们考虑如何私下分享客观扰动,使用每个实例差异隐私(PDP)所产生的个性化隐私损失。标准差异隐私(DP)为我们提供了一个最坏的绑定,可能是相对于固定数据集的特定个人的隐私丢失的数量级。PDP框架对目标个人的隐私保障提供了更细粒度的分析,但每个实例隐私损失本身可能是敏感数据的函数。在本文中,我们分析了通过客观扰动释放私人经验风险最小化器的每案隐私丧失,并提出一组私下和准确地公布PDP损失的方法,没有额外的隐私费用。
translated by 谷歌翻译
在本文中,我们研究了非交互性局部差异隐私(NLDP)模型中估计平滑普遍线性模型(GLM)的问题。与其经典设置不同,我们的模型允许服务器访问一些其他公共但未标记的数据。在本文的第一部分中,我们专注于GLM。具体而言,我们首先考虑每个数据记录均为I.I.D.的情况。从零均值的多元高斯分布中取样。由Stein的引理动机,我们提出了GLMS的$(Epsilon,\ delta)$ -NLDP算法。此外,算法的公共数据和私人数据的示例复杂性以实现$ \ alpha $的$ \ ell_2 $ -norm估计错误(具有高概率)为$ {o}(p \ alpha^{ - 2})$和$ \ tilde {o}(p^3 \ alpha^{ - 2} \ epsilon^{ - 2})$,其中$ p $是特征向量的维度。这是对$ \ alpha^{ - 1} $中先前已知的指数或准过程的重大改进,或者在$ p $中的指数smack sample sample smack glms的复杂性,没有公共数据。然后,我们考虑一个更通用的设置,每个数据记录为I.I.D.从某些次高斯分布中取样,有限制的$ \ ell_1 $ -norm。基于Stein的引理的变体,我们提出了一个$(\ epsilon,\ delta)$ - NLDP算法,用于GLMS的公共和私人数据的样本复杂性,以实现$ \ ell_ \ elfty $ - infty $ -NOMM估计的$ \ alpha误差$是$ is $ {o}(p^2 \ alpha^{ - 2})$和$ \ tilde {o}(p^2 \ alpha^{ - 2} \ epsilon^{ - 2})$,温和的假设,如果$ \ alpha $不太小({\ em i.e.,} $ \ alpha \ geq \ omega(\ frac {1} {\ sqrt {p}}})$)。在本文的第二部分中,我们将我们的想法扩展到估计非线性回归的问题,并显示出与多元高斯和次高斯案例的GLMS相似的结果。最后,我们通过对合成和现实世界数据集的实验来证明算法的有效性。
translated by 谷歌翻译
我们研究了差异私有线性回归的问题,其中每个数据点都是从固定的下高斯样式分布中采样的。我们提出和分析了一个单次迷你批次随机梯度下降法(DP-AMBSSGD),其中每次迭代中的点都在没有替换的情况下进行采样。为DP添加了噪声,但噪声标准偏差是在线估计的。与现有$(\ epsilon,\ delta)$ - 具有子最佳错误界限的DP技术相比,DP-AMBSSGD能够在关键参数(如多维参数)(如多维参数)等方面提供几乎最佳的错误范围$,以及观测值的噪声的标准偏差$ \ sigma $。例如,当对$ d $二维的协变量进行采样时。从正常分布中,然后由于隐私而引起的DP-AMBSSGD的多余误差为$ \ frac {\ sigma^2 d} {n} {n}(1+ \ frac {d} {\ epsilon^2 n})$,即当样本数量$ n = \ omega(d \ log d)$,这是线性回归的标准操作制度时,错误是有意义的。相比之下,在此设置中现有有效方法的错误范围为:$ \ mathcal {o} \ big(\ frac {d^3} {\ epsilon^2 n^2} \ big)$,即使是$ \ sigma = 0 $。也就是说,对于常量的$ \ epsilon $,现有技术需要$ n = \ omega(d \ sqrt {d})$才能提供非平凡的结果。
translated by 谷歌翻译
In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown $d$-dimensional Gaussian distribution up to an arbitrary tiny total variation error using $\widetilde{O}(d^2 \log \kappa)$ samples while tolerating a constant fraction of adversarial outliers. Here, $\kappa$ is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number $\kappa$ in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using $\widetilde{O}(d^2)$ samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of $\widetilde O(d)$, improving on a $\widetilde O(d^{1.5})$ bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022].
translated by 谷歌翻译
我们提出了一种基于优化的基于优化的框架,用于计算差异私有M估算器以及构建差分私立置信区的新方法。首先,我们表明稳健的统计数据可以与嘈杂的梯度下降或嘈杂的牛顿方法结合使用,以便分别获得具有全局线性或二次收敛的最佳私人估算。我们在局部强大的凸起和自我协调下建立当地和全球融合保障,表明我们的私人估算变为对非私人M估计的几乎最佳附近的高概率。其次,我们通过构建我们私有M估计的渐近方差的差异私有估算来解决参数化推断的问题。这自然导致近​​似枢轴统计,用于构建置信区并进行假设检测。我们展示了偏置校正的有效性,以提高模拟中的小样本实证性能。我们说明了我们在若干数值例子中的方法的好处。
translated by 谷歌翻译
The ''Propose-Test-Release'' (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is nice. We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from ''Private Aggregation of Teacher Ensembles (PATE)'' -- privately releasing the entire model with a delicate data-dependent analysis.
translated by 谷歌翻译
在本文中,我们研究了代理人(个人)具有战略性或自我利益的情况,并且在报告数据时关注其隐私。与经典环境相比,我们的目标是设计机制,这些机制既可以激励大多数代理来真实地报告他们的数据并保留个人报告的隐私,而它们的输出也应接近基础参数。在本文的第一部分中,我们考虑了协变量是次高斯的情况,并且在他们只有有限的第四瞬间的情况下进行了重尾。首先,我们是受可能性功能最大化器的固定条件的动机,我们得出了一种新颖的私人和封闭式估计量。基于估算器,我们提出了一种机制,该机制通过对几种规范模型的计算和付款方案进行一些适当的设计具有以下属性,例如线性回归,逻辑回归和泊松回归:(1)机制为$ O(1) $ - 接点差异私有(概率至少$ 1-O(1)$); (2)这是一个$ o(\ frac {1} {n})$ - 近似于$(1-o(1))$的代理的近似贝叶斯nash平衡,以真实地报告其数据,其中$ n $是代理人的数量; (3)输出可能会达到基础参数的$ O(1)$; (4)对于机制中的$(1-o(1))$的代理分数是个人合理的; (5)分析师运行该机制所需的付款预算为$ O(1)$。在第二部分中,我们考虑了在更通用的环境下的线性回归模型,在该设置中,协变量和响应都是重尾,只有有限的第四次矩。通过使用$ \ ell_4 $ -norm收缩运算符,我们提出了一种私人估算器和付款方案,该方案具有与次高斯案例相似的属性。
translated by 谷歌翻译
近年来,保护隐私数据分析已成为普遍存在。在本文中,我们提出了分布式私人多数票投票机制,以解决分布式设置中的标志选择问题。为此,我们将迭代剥离应用于稳定性函数,并使用指数机制恢复符号。作为应用程序,我们研究了分布式系统中的平均估计和线性回归问题的私人标志选择。我们的方法与非私有场景一样,用最佳的信噪比恢复了支持和标志,这比私人变量选择的现代作品要好。此外,符号选择一致性具有理论保证是合理的。进行了模拟研究以证明我们提出的方法的有效性。
translated by 谷歌翻译
我们考虑对跨用户设备分发的私人数据培训模型。为了确保隐私,我们添加了设备的噪声并使用安全的聚合,以便仅向服务器揭示嘈杂的总和。我们提出了一个综合的端到端系统,该系统适当地离散数据并在执行安全聚合之前添加离散的高斯噪声。我们为离散高斯人的总和提供了新的隐私分析,并仔细分析了数据量化和模块化求和算术的影响。我们的理论保证突出了沟通,隐私和准确性之间的复杂张力。我们广泛的实验结果表明,我们的解决方案基本上能够将准确性与中央差分隐私相匹配,而每个值的精度少于16位。
translated by 谷歌翻译
We study the task of training regression models with the guarantee of label differential privacy (DP). Based on a global prior distribution on label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a ``randomized response on bins'', and propose an efficient algorithm for finding the optimal bin values. We carry out a thorough experimental evaluation on several datasets demonstrating the efficacy of our algorithm.
translated by 谷歌翻译
在许多应用程序中,多方拥有有关相同用户的私人数据,但在属性的脱节集上,服务器希望利用数据来训练模型。为了在保护数据主体的隐私时启用模型学习,我们需要垂直联合学习(VFL)技术,其中数据派对仅共享用于培训模型的信息,而不是私人数据。但是,确保共享信息在学习准确的模型的同时保持隐私是一项挑战。据我们所知,本文提出的算法是第一个实用的解决方案,用于差异化垂直联合K-均值聚类,服务器可以在其中获得具有可证明的差异隐私保证的全球中心。我们的算法假设一个不受信任的中央服务器,该服务器汇总了本地数据派对的差异私有本地中心和成员资格编码。它基于收到的信息构建加权网格作为全局数据集的概要。最终中心是通过在加权网格上运行任何K-均值算法而产生的。我们的网格重量估计方法采用了基于Flajolet-Martin草图的新颖,轻巧和差异私有的相交基数估计算法。为了提高两个以上数据方的设置中的估计准确性,我们进一步提出了权重估计算法的精致版本和参数调整策略,以减少最终的K-均值实用程序,以便在中央私人环境中接近它。我们为由我们的算法计算的群集中心提供了理论实用性分析和实验评估结果,并表明我们的方法在理论上和经验上都比基于现有技术的两个基准在理论上和经验上的表现更好。
translated by 谷歌翻译
Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ǫ-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.
translated by 谷歌翻译
The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov decision process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset.
translated by 谷歌翻译
Deep neural networks have strong capabilities of memorizing the underlying training data, which can be a serious privacy concern. An effective solution to this problem is to train models with differential privacy, which provides rigorous privacy guarantees by injecting random noise to the gradients. This paper focuses on the scenario where sensitive data are distributed among multiple participants, who jointly train a model through federated learning (FL), using both secure multiparty computation (MPC) to ensure the confidentiality of each gradient update, and differential privacy to avoid data leakage in the resulting model. A major challenge in this setting is that common mechanisms for enforcing DP in deep learning, which inject real-valued noise, are fundamentally incompatible with MPC, which exchanges finite-field integers among the participants. Consequently, most existing DP mechanisms require rather high noise levels, leading to poor model utility. Motivated by this, we propose Skellam mixture mechanism (SMM), an approach to enforce DP on models built via FL. Compared to existing methods, SMM eliminates the assumption that the input gradients must be integer-valued, and, thus, reduces the amount of noise injected to preserve DP. Further, SMM allows tight privacy accounting due to the nice composition and sub-sampling properties of the Skellam distribution, which are key to accurate deep learning with DP. The theoretical analysis of SMM is highly non-trivial, especially considering (i) the complicated math of differentially private deep learning in general and (ii) the fact that the mixture of two Skellam distributions is rather complex, and to our knowledge, has not been studied in the DP literature. Extensive experiments on various practical settings demonstrate that SMM consistently and significantly outperforms existing solutions in terms of the utility of the resulting model.
translated by 谷歌翻译
Given a symmetric matrix $M$ and a vector $\lambda$, we present new bounds on the Frobenius-distance utility of the Gaussian mechanism for approximating $M$ by a matrix whose spectrum is $\lambda$, under $(\varepsilon,\delta)$-differential privacy. Our bounds depend on both $\lambda$ and the gaps in the eigenvalues of $M$, and hold whenever the top $k+1$ eigenvalues of $M$ have sufficiently large gaps. When applied to the problems of private rank-$k$ covariance matrix approximation and subspace recovery, our bounds yield improvements over previous bounds. Our bounds are obtained by viewing the addition of Gaussian noise as a continuous-time matrix Brownian motion. This viewpoint allows us to track the evolution of eigenvalues and eigenvectors of the matrix, which are governed by stochastic differential equations discovered by Dyson. These equations allow us to bound the utility as the square-root of a sum-of-squares of perturbations to the eigenvectors, as opposed to a sum of perturbation bounds obtained via Davis-Kahan-type theorems.
translated by 谷歌翻译
在本文中,我们研究了差异化的私人经验风险最小化(DP-erm)。已经表明,随着尺寸的增加,DP-MER的(最坏的)效用会减小。这是私下学习大型机器学习模型的主要障碍。在高维度中,某些模型的参数通常比其他参数更多的信息是常见的。为了利用这一点,我们提出了一个差异化的私有贪婪坐标下降(DP-GCD)算法。在每次迭代中,DP-GCD私人沿梯度(大约)最大条目执行坐标梯度步骤。从理论上讲,DP-GCD可以通过利用问题解决方案的结构特性(例如稀疏性或准方面的)来改善实用性,并在早期迭代中取得非常快速的进展。然后,我们在合成数据集和真实数据集上以数值说明。最后,我们描述了未来工作的有前途的方向。
translated by 谷歌翻译
通过随机梯度Langevin Dynamics(SGLD)的贝叶斯学习已被建议用于私人学习。尽管先前的研究在算法的初始步骤或接近融合时为SGLD提供了不同的隐私范围,但在两者之间可以解决哪些差异隐私保证的问题。这个临时区域非常重要,尤其是对于贝叶斯神经网络,因为很难保证与后部的融合。本文表明,即使从后验进行采样时,使用SGLD可能会导致此中期区域无限制的隐私损失。
translated by 谷歌翻译
我们介绍了一种差别的私有方法来测量遍布两个实体托管的敏感数据之间的非线性相关性。我们提供私人估算器的实用程序保障。我们是第一个非线性相关性的私人估算器,据我们在多方设置中的知识中最好。我们认为的非线性相关的重要措施是距离相关性。这项工作具有直接应用于私有功能筛选,私人独立测试,私人K样品测试,私有多方因果推断和私有数据综合,除了探索数据分析。代码访问:公开访问的链接在补充文件中提供了代码。
translated by 谷歌翻译
Rankings are widely collected in various real-life scenarios, leading to the leakage of personal information such as users' preferences on videos or news. To protect rankings, existing works mainly develop privacy protection on a single ranking within a set of ranking or pairwise comparisons of a ranking under the $\epsilon$-differential privacy. This paper proposes a novel notion called $\epsilon$-ranking differential privacy for protecting ranks. We establish the connection between the Mallows model (Mallows, 1957) and the proposed $\epsilon$-ranking differential privacy. This allows us to develop a multistage ranking algorithm to generate synthetic rankings while satisfying the developed $\epsilon$-ranking differential privacy. Theoretical results regarding the utility of synthetic rankings in the downstream tasks, including the inference attack and the personalized ranking tasks, are established. For the inference attack, we quantify how $\epsilon$ affects the estimation of the true ranking based on synthetic rankings. For the personalized ranking task, we consider varying privacy preferences among users and quantify how their privacy preferences affect the consistency in estimating the optimal ranking function. Extensive numerical experiments are carried out to verify the theoretical results and demonstrate the effectiveness of the proposed synthetic ranking algorithm.
translated by 谷歌翻译