我们考虑在数据源相似但非相同的高维环境中荟萃分析的任务。为了在这种异质数据集中借用强度,我们引入了一个全球参数,该参数强调存在异质性的解释性和统计效率。我们还提出了一个全局参数的单发估计器,该估计值保留了数据源的匿名性,并以取决于组合数据集大小的速率收敛。对于高维线性模型设置,我们在适应以前看到的数据分布以及预测新/看不见的数据分布方面证明了识别限制的优越性。最后,我们证明了方法在涉及多个癌细胞线的大规模药物治疗数据集中的好处。
translated by 谷歌翻译
我们提出了一种分布式引导方法,用于同时推断高维大量数据,该数据被许多机器存储和处理。该方法基于通信有效的偏差套索产生$ \ ell_ \ infty $ norm置信区域,我们提出了一种有效的交叉验证方法来调整每种迭代的方法。从理论上讲,我们证明了对通信的数量$ \ tau _ {\ min} $的下限,它值得统计准确性和效率。此外,$ \ tau _ {\ min} $仅与工人数量和固有维度的对数增加,而几乎不变为标称维度。我们通过广泛的仿真研究测试我们的理论,以及基于美国航空公司的按时绩效数据集的半合成数据集上的可变筛选任务。复制数值结果的代码可在GitHub上获得:https://github.com/skchao74/distributed-bootstrap。
translated by 谷歌翻译
我们解决了如何在没有严格缩放条件的情况下实现分布式分数回归中最佳推断的问题。由于分位数回归(QR)损失函数的非平滑性质,这是具有挑战性的,这使现有方法的使用无效。难度通过应用于本地(每个数据源)和全局目标函数的双光滑方法解决。尽管依赖局部和全球平滑参数的精致组合,但分位数回归模型是完全参数的,从而促进了解释。在低维度中,我们为顺序定义的分布式QR估计器建立了有限样本的理论框架。这揭示了通信成本和统计错误之间的权衡。我们进一步讨论并比较了基于WALD和得分型测试和重采样技术的反转的几种替代置信集结构,并详细介绍了对更极端分数系数有效的改进。在高维度中,采用了一个稀疏的框架,其中提出的双滑目标功能与$ \ ell_1 $ -penalty相辅相成。我们表明,相应的分布式QR估计器在近乎恒定的通信回合之后达到了全球收敛率。一项彻底的模拟研究进一步阐明了我们的发现。
translated by 谷歌翻译
High-dimensional data can often display heterogeneity due to heteroscedastic variance or inhomogeneous covariate effects. Penalized quantile and expectile regression methods offer useful tools to detect heteroscedasticity in high-dimensional data. The former is computationally challenging due to the non-smooth nature of the check loss, and the latter is sensitive to heavy-tailed error distributions. In this paper, we propose and study (penalized) robust expectile regression (retire), with a focus on iteratively reweighted $\ell_1$-penalization which reduces the estimation bias from $\ell_1$-penalization and leads to oracle properties. Theoretically, we establish the statistical properties of the retire estimator under two regimes: (i) low-dimensional regime in which $d \ll n$; (ii) high-dimensional regime in which $s\ll n\ll d$ with $s$ denoting the number of significant predictors. In the high-dimensional setting, we carefully characterize the solution path of the iteratively reweighted $\ell_1$-penalized retire estimation, adapted from the local linear approximation algorithm for folded-concave regularization. Under a mild minimum signal strength condition, we show that after as many as $\log(\log d)$ iterations the final iterate enjoys the oracle convergence rate. At each iteration, the weighted $\ell_1$-penalized convex program can be efficiently solved by a semismooth Newton coordinate descent algorithm. Numerical studies demonstrate the competitive performance of the proposed procedure compared with either non-robust or quantile regression based alternatives.
translated by 谷歌翻译
由于在数据稀缺的设置中,交叉验证的性能不佳,我们提出了一个新颖的估计器,以估计数据驱动的优化策略的样本外部性能。我们的方法利用优化问题的灵敏度分析来估计梯度关于数据中噪声量的最佳客观值,并利用估计的梯度将策略的样本中的表现为依据。与交叉验证技术不同,我们的方法避免了为测试集牺牲数据,在训练和因此非常适合数据稀缺的设置时使用所有数据。我们证明了我们估计量的偏见和方差范围,这些问题与不确定的线性目标优化问题,但已知的,可能是非凸的,可行的区域。对于更专业的优化问题,从某种意义上说,可行区域“弱耦合”,我们证明结果更强。具体而言,我们在估算器的错误上提供明确的高概率界限,该估计器在策略类别上均匀地保持,并取决于问题的维度和策略类的复杂性。我们的边界表明,在轻度条件下,随着优化问题的尺寸的增长,我们的估计器的误差也会消失,即使可用数据的量仍然很小且恒定。说不同的是,我们证明我们的估计量在小型数据中的大规模政权中表现良好。最后,我们通过数值将我们提出的方法与最先进的方法进行比较,通过使用真实数据调度紧急医疗响应服务的案例研究。我们的方法提供了更准确的样本外部性能估计,并学习了表现更好的政策。
translated by 谷歌翻译
In a high dimensional linear predictive regression where the number of potential predictors can be larger than the sample size, we consider using LASSO, a popular L1-penalized regression method, to estimate the sparse coefficients when many unit root regressors are present. Consistency of LASSO relies on two building blocks: the deviation bound of the cross product of the regressors and the error term, and the restricted eigenvalue of the Gram matrix of the regressors. In our setting where unit root regressors are driven by temporal dependent non-Gaussian innovations, we establish original probabilistic bounds for these two building blocks. The bounds imply that the rates of convergence of LASSO are different from those in the familiar cross sectional case. In practical applications given a mixture of stationary and nonstationary predictors, asymptotic guarantee of LASSO is preserved if all predictors are scale-standardized. In an empirical example of forecasting the unemployment rate with many macroeconomic time series, strong performance is delivered by LASSO when the initial specification is guided by macroeconomic domain expertise.
translated by 谷歌翻译
我们提出了一种估计具有标称分类数据的高维线性模型的方法。我们的估算器,称为范围,通过使其相应的系数完全相等来融合水平。这是通过对分类变量的系数的阶数统计之间的差异之间的差异来实现这一点,从而聚类系数。我们提供了一种算法,用于精确和有效地计算在具有潜在许多级别的单个变量的情况下的总体上的最小值的全局最小值,并且在多变量情况下在块坐标血管下降过程中使用它。我们表明,利用未知级别融合的Oracle最小二乘解决方案是具有高概率的坐标血缘的极限点,只要真正的级别具有一定的最小分离;已知这些条件在单变量案例中最小。我们展示了在一系列实际和模拟数据集中的范围的有利性能。 R包的R包Catreg实现线性模型的范围,也可以在CRAN上提供逻辑回归的版本。
translated by 谷歌翻译
This paper provides estimation and inference methods for a conditional average treatment effects (CATE) characterized by a high-dimensional parameter in both homogeneous cross-sectional and unit-heterogeneous dynamic panel data settings. In our leading example, we model CATE by interacting the base treatment variable with explanatory variables. The first step of our procedure is orthogonalization, where we partial out the controls and unit effects from the outcome and the base treatment and take the cross-fitted residuals. This step uses a novel generic cross-fitting method we design for weakly dependent time series and panel data. This method "leaves out the neighbors" when fitting nuisance components, and we theoretically power it by using Strassen's coupling. As a result, we can rely on any modern machine learning method in the first step, provided it learns the residuals well enough. Second, we construct an orthogonal (or residual) learner of CATE -- the Lasso CATE -- that regresses the outcome residual on the vector of interactions of the residualized treatment with explanatory variables. If the complexity of CATE function is simpler than that of the first-stage regression, the orthogonal learner converges faster than the single-stage regression-based learner. Third, we perform simultaneous inference on parameters of the CATE function using debiasing. We also can use ordinary least squares in the last two steps when CATE is low-dimensional. In heterogeneous panel data settings, we model the unobserved unit heterogeneity as a weakly sparse deviation from Mundlak (1978)'s model of correlated unit effects as a linear function of time-invariant covariates and make use of L1-penalization to estimate these models. We demonstrate our methods by estimating price elasticities of groceries based on scanner data. We note that our results are new even for the cross-sectional (i.i.d) case.
translated by 谷歌翻译
We extend best-subset selection to linear Multi-Task Learning (MTL), where a set of linear models are jointly trained on a collection of datasets (``tasks''). Allowing the regression coefficients of tasks to have different sparsity patterns (i.e., different supports), we propose a modeling framework for MTL that encourages models to share information across tasks, for a given covariate, through separately 1) shrinking the coefficient supports together, and/or 2) shrinking the coefficient values together. This allows models to borrow strength during variable selection even when the coefficient values differ markedly between tasks. We express our modeling framework as a Mixed-Integer Program, and propose efficient and scalable algorithms based on block coordinate descent and combinatorial local search. We show our estimator achieves statistically optimal prediction rates. Importantly, our theory characterizes how our estimator leverages the shared support information across tasks to achieve better variable selection performance. We evaluate the performance of our method in simulations and two biology applications. Our proposed approaches outperform other sparse MTL methods in variable selection and prediction accuracy. Interestingly, penalties that shrink the supports together often outperform penalties that shrink the coefficient values together. We will release an R package implementing our methods.
translated by 谷歌翻译
在本文中,我们利用过度参数化来设计高维单索索引模型的无规矩算法,并为诱导的隐式正则化现象提供理论保证。具体而言,我们研究了链路功能是非线性且未知的矢量和矩阵单索引模型,信号参数是稀疏向量或低秩对称矩阵,并且响应变量可以是重尾的。为了更好地理解隐含正规化的角色而没有过度的技术性,我们假设协变量的分布是先验的。对于载体和矩阵设置,我们通过采用分数函数变换和专为重尾数据的强大截断步骤来构造过度参数化最小二乘损耗功能。我们建议通过将无规则化的梯度下降应用于损耗函数来估计真实参数。当初始化接近原点并且步骤中足够小时,我们证明了所获得的解决方案在载体和矩阵案件中实现了最小的收敛统计速率。此外,我们的实验结果支持我们的理论调查结果,并表明我们的方法在$ \ ell_2 $ -staticatisticated率和变量选择一致性方面具有明确的正则化的经验卓越。
translated by 谷歌翻译
异常值广泛发生在大数据应用中,可能严重影响统计估计和推理。在本文中,引入了抗强估计的框架,以强制任意给出的损耗函数。它与修剪方法密切连接,并且包括所有样本的显式外围参数,这反过来促进计算,理论和参数调整。为了解决非凸起和非体性的问题,我们开发可扩展的算法,以实现轻松和保证快速收敛。特别地,提出了一种新的技术来缓解对起始点的要求,使得在常规数据集上,可以大大减少数据重采样的数量。基于组合的统计和计算处理,我们能够超越M估计来执行非因思分析。所获得的抗性估算器虽然不一定全局甚至是局部最佳的,但在低维度和高维度中享有最小的速率最优性。回归,分类和神经网络的实验表明,在总异常值发生的情况下提出了拟议方法的优异性能。
translated by 谷歌翻译
套索是一种高维回归的方法,当时,当协变量$ p $的订单数量或大于观测值$ n $时,通常使用它。由于两个基本原因,经典的渐近态性理论不适用于该模型:$(1)$正规风险是非平滑的; $(2)$估算器$ \ wideHat {\ boldsymbol {\ theta}} $与true参数vector $ \ boldsymbol {\ theta}^*$无法忽略。结果,标准的扰动论点是渐近正态性的传统基础。另一方面,套索估计器可以精确地以$ n $和$ p $大,$ n/p $的订单为一。这种表征首先是在使用I.I.D的高斯设计的情况下获得的。协变量:在这里,我们将其推广到具有非偏差协方差结构的高斯相关设计。这是根据更简单的``固定设计''模型表示的。我们在两个模型中各种数量的分布之间的距离上建立了非反应界限,它们在合适的稀疏类别中均匀地固定在信号上$ \ boldsymbol {\ theta}^*$。作为应用程序,我们研究了借助拉索的分布,并表明需要校正程度对于计算有效的置信区间是必要的。
translated by 谷歌翻译
本文提出了在多阶段实验的背景下的异质治疗效应的置信区间结构,以$ N $样品和高维,$ D $,混淆。我们的重点是$ d \ gg n $的情况,但获得的结果也适用于低维病例。我们展示了正则化估计的偏差,在高维变焦空间中不可避免,具有简单的双重稳固分数。通过这种方式,不需要额外的偏差,并且我们获得root $ N $推理结果,同时允许治疗和协变量的多级相互依赖性。记忆财产也没有假设;治疗可能取决于所有先前的治疗作业以及以前的所有多阶段混淆。我们的结果依赖于潜在依赖的某些稀疏假设。我们发现具有动态处理的强大推理所需的新产品率条件。
translated by 谷歌翻译
Integrative analysis of data from multiple sources is critical to making generalizable discoveries. Associations that are consistently observed across multiple source populations are more likely to be generalized to target populations with possible distributional shifts. In this paper, we model the heterogeneous multi-source data with multiple high-dimensional regressions and make inferences for the maximin effect (Meinshausen, B{\"u}hlmann, AoS, 43(4), 1801--1830). The maximin effect provides a measure of stable associations across multi-source data. A significant maximin effect indicates that a variable has commonly shared effects across multiple source populations, and these shared effects may be generalized to a broader set of target populations. There are challenges associated with inferring maximin effects because its point estimator can have a non-standard limiting distribution. We devise a novel sampling method to construct valid confidence intervals for maximin effects. The proposed confidence interval attains a parametric length. This sampling procedure and the related theoretical analysis are of independent interest for solving other non-standard inference problems. Using genetic data on yeast growth in multiple environments, we demonstrate that the genetic variants with significant maximin effects have generalizable effects under new environments.
translated by 谷歌翻译
稳定性选择(Meinshausen和Buhlmann,2010)通过返回许多副页面一致选择的功能来使任何特征选择方法更稳定。我们证明(在我们的知识中,它的知识,它的第一个结果),对于包含重要潜在变量的高度相关代理的数据,套索通常选择一个代理,但与套索的稳定性选择不能选择任何代理,导致比单独的套索更糟糕的预测性能。我们介绍集群稳定性选择,这利用了从业者的知识,即数据中存在高度相关的集群,从而产生比此设置中的稳定性选择更好的特征排名。我们考虑了几种特征组合方法,包括在每个重要集群中占据各个重要集群中的特征的加权平均值,其中重量由选择集群成员的频率决定,我们显示的是比以前的提案更好地导致更好的预测模型。我们呈现来自Meinshausen和Buhlmann(2010)和Shah和Samworth(2012)的理论担保的概括,以表明集群稳定选择保留相同的保证。总之,集群稳定性选择享有两个世界的最佳选择,产生既稳定的稀疏选择集,具有良好的预测性能。
translated by 谷歌翻译
我们研究稀疏的线性回归在一个代理网络上,建模为无向图(没有集中式节点)。估计问题被制定为当地套索损失函数的最小化,加上共识约束的二次惩罚 - 后者是获取分布式解决方案方法的工具。虽然在优化文献中广泛研究了基于惩罚的共识方法,但其高维设置中的统计和计算保证仍不清楚。这项工作提供了对此公开问题的答案。我们的贡献是两倍。 First, we establish statistical consistency of the estimator: under a suitable choice of the penalty parameter, the optimal solution of the penalized problem achieves near optimal minimax rate $\mathcal{O}(s \log d/N)$ in $\ell_2 $ -loss,$ s $是稀疏性值,$ d $是环境维度,$ n $是网络中的总示例大小 - 这与集中式采样率相匹配。其次,我们表明,应用于惩罚问题的近端梯度算法,它自然导致分布式实现,线性地收敛到集中统计误差的顺序的公差 - 速率比例为$ \ mathcal {o}( d)$,揭示不可避免的速度准确性困境。数值结果证明了衍生的采样率和收敛速率缩放的紧张性。
translated by 谷歌翻译
本文为信号去噪提供了一般交叉验证框架。然后将一般框架应用于非参数回归方法,例如趋势过滤和二元推车。然后显示所得到的交叉验证版本以获得最佳调谐的类似物所熟知的几乎相同的收敛速度。没有任何先前的趋势过滤或二元推车的理论分析。为了说明框架的一般性,我们还提出并研究了两个基本估算器的交叉验证版本;套索用于高维线性回归和矩阵估计的奇异值阈值阈值。我们的一般框架是由Chatterjee和Jafarov(2015)的想法的启发,并且可能适用于使用调整参数的广泛估算方法。
translated by 谷歌翻译
我们提出了一种基于优化的基于优化的框架,用于计算差异私有M估算器以及构建差分私立置信区的新方法。首先,我们表明稳健的统计数据可以与嘈杂的梯度下降或嘈杂的牛顿方法结合使用,以便分别获得具有全局线性或二次收敛的最佳私人估算。我们在局部强大的凸起和自我协调下建立当地和全球融合保障,表明我们的私人估算变为对非私人M估计的几乎最佳附近的高概率。其次,我们通过构建我们私有M估计的渐近方差的差异私有估算来解决参数化推断的问题。这自然导致近​​似枢轴统计,用于构建置信区并进行假设检测。我们展示了偏置校正的有效性,以提高模拟中的小样本实证性能。我们说明了我们在若干数值例子中的方法的好处。
translated by 谷歌翻译
现代高维方法经常采用“休稀稀物”的原则,而在监督多元学习统计学中可能面临着大量非零系数的“密集”问题。本文提出了一种新的聚类减少秩(CRL)框架,其施加了两个联合矩阵规范化,以自动分组构建预测因素的特征。 CRL比低级别建模更具可解释,并放松变量选择中的严格稀疏假设。在本文中,提出了新的信息 - 理论限制,揭示了寻求集群的内在成本,以及多元学习中的维度的祝福。此外,开发了一种有效的优化算法,其执行子空间学习和具有保证融合的聚类。所获得的定点估计器虽然不一定是全局最佳的,但在某些规则条件下享有超出标准似然设置的所需的统计准确性。此外,提出了一种新的信息标准,以及其无垢形式,用于集群和秩选择,并且具有严格的理论支持,而不假设无限的样本大小。广泛的模拟和实数据实验证明了所提出的方法的统计准确性和可解释性。
translated by 谷歌翻译
Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches.
translated by 谷歌翻译