由于其出色的经验表现,随机森林是过去十年中使用的机器学习方法之一。然而,由于其黑框的性质,在许多大数据应用中很难解释随机森林的结果。量化各个特征在随机森林中的实用性可以大大增强其解释性。现有的研究表明,一些普遍使用的特征对随机森林的重要性措施遭受了偏见问题。此外,对于大多数现有方法,缺乏全面的规模和功率分析。在本文中,我们通过假设检验解决了问题,并提出了一个自由化特征 - 弥散性相关测试(事实)的框架,以评估具有偏见性属性的随机森林模型中给定特征的重要性,我们零假设涉及该特征是否与所有其他特征有条件地独立于响应。关于高维随机森林一致性的一些最新发展,对随机森林推断的这种努力得到了赋予的能力。在存在功能依赖性的情况下,我们的事实测试的香草版可能会遇到偏见问题。我们利用偏置校正的不平衡和调节技术。我们通过增强功率的功能转换将合奏的想法进一步纳入事实统计范围。在相当普遍的具有依赖特征的高维非参数模型设置下,我们正式确定事实可以提供理论上合理的随机森林具有P值,并通过非催化分析享受吸引人的力量。新建议的方法的理论结果和有限样本优势通过几个模拟示例和与Covid-19的经济预测应用进行了说明。
translated by 谷歌翻译
Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches.
translated by 谷歌翻译
Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution, regardless of how $d$ scales with $n$. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for some classical problems including one-sample mean and covariance testing, and show that our tests have minimax rate-optimal power against appropriate local alternatives. In most settings, our cross U-statistic matches the high-dimensional power of the corresponding (degenerate) U-statistic up to a $\sqrt{2}$ factor.
translated by 谷歌翻译
加权最近的邻居(WNN)估计量通常用作平均回归估计的灵活且易于实现的非参数工具。袋装技术是一种优雅的方式,可以自动生成最近邻居的重量的WNN估计器;我们将最终的估计量命名为分布最近的邻居(DNN),以便于参考。然而,这种估计器缺乏分布结果,从而将其应用于统计推断。此外,当平均回归函数具有高阶平滑度时,DNN无法达到最佳的非参数收敛率,这主要是由于偏差问题。在这项工作中,我们对DNN提供了深入的技术分析,我们建议通过线性将两个DNN估计量与不同的子采样量表进行线性相结合,从而提出了DNN估计量的偏差方法,从而导致新型的两尺度DNN(TDNN(TDNN) )估计器。两尺度的DNN估计量具有等效的WNN表示,重量承认明确形式,有些则是负面的。我们证明,由于使用负权重,两尺度DNN估计器在四阶平滑度条件下估算回归函数时享有最佳的非参数收敛速率。我们进一步超出了估计,并确定DNN和两个规模的DNN均无渐进地正常,因为亚次采样量表和样本量差异到无穷大。对于实际实施,我们还使用二尺度DNN的Jacknife和Bootstrap技术提供方差估计器和分配估计器。可以利用这些估计器来构建有效的置信区间,以用于回归函数的非参数推断。建议的两尺度DNN方法的理论结果和吸引人的有限样本性能用几个数值示例说明了。
translated by 谷歌翻译
本文研究了在潜在的结果框架中使用深神经网络(DNN)的平均治疗效果(ATE)的估计和推理。在一些规则性条件下,观察到的响应可以作为与混杂变量和治疗指标作为自变量的平均回归问题的响应。使用这种配方,我们研究了通过使用特定网络架构的DNN回归基于估计平均回归函数的两种尝试估计和推断方法。我们表明ATE的两个DNN估计在底层真正的均值回归模型上的一些假设下与无维一致性率一致。我们的模型假设可容纳观察到的协变量的潜在复杂的依赖结构,包括治疗指标和混淆变量之间的潜在因子和非线性相互作用。我们还基于采样分裂的思想,确保精确推理和不确定量化,建立了我们估计的渐近常态。仿真研究和实际数据应用证明了我们的理论调查结果,支持我们的DNN估计和推理方法。
translated by 谷歌翻译
套索是一种高维回归的方法,当时,当协变量$ p $的订单数量或大于观测值$ n $时,通常使用它。由于两个基本原因,经典的渐近态性理论不适用于该模型:$(1)$正规风险是非平滑的; $(2)$估算器$ \ wideHat {\ boldsymbol {\ theta}} $与true参数vector $ \ boldsymbol {\ theta}^*$无法忽略。结果,标准的扰动论点是渐近正态性的传统基础。另一方面,套索估计器可以精确地以$ n $和$ p $大,$ n/p $的订单为一。这种表征首先是在使用I.I.D的高斯设计的情况下获得的。协变量:在这里,我们将其推广到具有非偏差协方差结构的高斯相关设计。这是根据更简单的``固定设计''模型表示的。我们在两个模型中各种数量的分布之间的距离上建立了非反应界限,它们在合适的稀疏类别中均匀地固定在信号上$ \ boldsymbol {\ theta}^*$。作为应用程序,我们研究了借助拉索的分布,并表明需要校正程度对于计算有效的置信区间是必要的。
translated by 谷歌翻译
统计推断中的主要范式取决于I.I.D.的结构。来自假设的无限人群的数据。尽管它取得了成功,但在复杂的数据结构下,即使在清楚无限人口所代表的内容的情况下,该框架在复杂的数据结构下仍然不灵活。在本文中,我们探讨了一个替代框架,在该框架中,推断只是对模型误差的不变性假设,例如交换性或符号对称性。作为解决这个不变推理问题的一般方法,我们提出了一个基于随机的过程。我们证明了该过程的渐近有效性的一般条件,并在许多数据结构中说明了,包括单向和双向布局中的群集误差。我们发现,通过残差随机化的不变推断具有三个吸引人的属性:(1)在弱且可解释的条件下是有效的,可以解决重型数据,有限聚类甚至一些高维设置的问题。 (2)它在有限样品中是可靠的,因为它不依赖经典渐近学所需的规律性条件。 (3)它以适应数据结构的统一方式解决了推断问题。另一方面,诸如OLS或Bootstrap之类的经典程序以I.I.D.为前提。结构,只要实际问题结构不同,就需要修改。经典框架中的这种不匹配导致了多种可靠的误差技术和自举变体,这些变体经常混淆应用研究。我们通过广泛的经验评估证实了这些发现。残留随机化对许多替代方案的表现有利,包括可靠的误差方法,自举变体和分层模型。
translated by 谷歌翻译
我们提出了对学度校正随机块模型(DCSBM)的合适性测试。该测试基于调整后的卡方统计量,用于测量$ n $多项式分布的组之间的平等性,该分布具有$ d_1,\ dots,d_n $观测值。在网络模型的背景下,多项式的数量($ n $)的数量比观测值数量($ d_i $)快得多,与节点$ i $的度相对应,因此设置偏离了经典的渐近学。我们表明,只要$ \ {d_i \} $的谐波平均值生长到无穷大,就可以使统计量在NULL下分配。顺序应用时,该测试也可以用于确定社区数量。该测试在邻接矩阵的压缩版本上进行操作,因此在学位上有条件,因此对大型稀疏网络具有高度可扩展性。我们结合了一个新颖的想法,即在测试$ K $社区时根据$(k+1)$ - 社区分配来压缩行。这种方法在不牺牲计算效率的情况下增加了顺序应用中的力量,我们证明了它在恢复社区数量方面的一致性。由于测试统计量不依赖于特定的替代方案,因此其效用超出了顺序测试,可用于同时测试DCSBM家族以外的各种替代方案。特别是,我们证明该测试与具有社区结构的潜在可变性网络模型的一般家庭一致。
translated by 谷歌翻译
随机森林仍然是最受欢迎的现成监督学习算法之一。尽管他们记录了良好的经验成功,但直到最近,很少有很少的理论结果来描述他们的表现和行为。在这项工作中,我们通过建立随机森林和其他受监督学习集合的融合率来推动最近的一致性和渐近正常的工作。我们培养了广义U形统计的概念,并显示在此框架内,随机森林预测可能对比以前建立的较大的子样本尺寸可能保持渐近正常。我们还提供Berry-esseen的界限,以量化这种收敛的速度,使得分列大小的角色和确定随机森林预测分布的树木的角色。
translated by 谷歌翻译
Many scientific and engineering challenges-ranging from personalized medicine to customized marketing recommendations-require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.
translated by 谷歌翻译
In nonparametric independence testing, we observe i.i.d.\ data $\{(X_i,Y_i)\}_{i=1}^n$, where $X \in \mathcal{X}, Y \in \mathcal{Y}$ lie in any general spaces, and we wish to test the null that $X$ is independent of $Y$. Modern test statistics such as the kernel Hilbert-Schmidt Independence Criterion (HSIC) and Distance Covariance (dCov) have intractable null distributions due to the degeneracy of the underlying U-statistics. Thus, in practice, one often resorts to using permutation testing, which provides a nonasymptotic guarantee at the expense of recalculating the quadratic-time statistics (say) a few hundred times. This paper provides a simple but nontrivial modification of HSIC and dCov (called xHSIC and xdCov, pronounced ``cross'' HSIC/dCov) so that they have a limiting Gaussian distribution under the null, and thus do not require permutations. This requires building on the newly developed theory of cross U-statistics by Kim and Ramdas (2020), and in particular developing several nontrivial extensions of the theory in Shekhar et al. (2022), which developed an analogous permutation-free kernel two-sample test. We show that our new tests, like the originals, are consistent against fixed alternatives, and minimax rate optimal against smooth local alternatives. Numerical simulations demonstrate that compared to the full dCov or HSIC, our variants have the same power up to a $\sqrt 2$ factor, giving practitioners a new option for large problems or data-analysis pipelines where computation, not sample size, could be the bottleneck.
translated by 谷歌翻译
随机奇异值分解(RSVD)是用于计算大型数据矩阵截断的SVD的一类计算算法。给定A $ n \ times n $对称矩阵$ \ mathbf {m} $,原型RSVD算法输出通过计算$ \ mathbf {m mathbf {m} $的$ k $引导singular vectors的近似m}^{g} \ mathbf {g} $;这里$ g \ geq 1 $是一个整数,$ \ mathbf {g} \ in \ mathbb {r}^{n \ times k} $是一个随机的高斯素描矩阵。在本文中,我们研究了一般的“信号加上噪声”框架下的RSVD的统计特性,即,观察到的矩阵$ \ hat {\ mathbf {m}} $被认为是某种真实但未知的加法扰动信号矩阵$ \ mathbf {m} $。我们首先得出$ \ ell_2 $(频谱规范)和$ \ ell_ {2 \ to \ infty} $(最大行行列$ \ ell_2 $ norm)$ \ hat {\ hat {\ Mathbf {M}} $和信号矩阵$ \ Mathbf {M} $的真实单数向量。这些上限取决于信噪比(SNR)和功率迭代$ g $的数量。观察到一个相变现象,其中较小的SNR需要较大的$ g $值以保证$ \ ell_2 $和$ \ ell_ {2 \ to \ fo \ infty} $ distances的收敛。我们还表明,每当噪声矩阵满足一定的痕量生长条件时,这些相变发生的$ g $的阈值都会很清晰。最后,我们得出了近似奇异向量的行波和近似矩阵的进入波动的正常近似。我们通过将RSVD的几乎最佳性能保证在应用于三个统计推断问题的情况下,即社区检测,矩阵完成和主要的组件分析,并使用缺失的数据来说明我们的理论结果。
translated by 谷歌翻译
嵌套模拟涉及通过模拟估算条件期望的功能。在本文中,我们提出了一种基于内核RIDGE回归的新方法,利用作为多维调节变量的函数的条件期望的平滑度。渐近分析表明,随着仿真预算的增加,所提出的方法可以有效地减轻了对收敛速度的维度诅咒,只要条件期望足够平滑。平滑度桥接立方根收敛速度之间的间隙(即标准嵌套模拟的最佳速率)和平方根收敛速率(即标准蒙特卡罗模拟的规范率)。我们通过来自投资组合风险管理和输入不确定性量化的数值例子来证明所提出的方法的性能。
translated by 谷歌翻译
In a high dimensional linear predictive regression where the number of potential predictors can be larger than the sample size, we consider using LASSO, a popular L1-penalized regression method, to estimate the sparse coefficients when many unit root regressors are present. Consistency of LASSO relies on two building blocks: the deviation bound of the cross product of the regressors and the error term, and the restricted eigenvalue of the Gram matrix of the regressors. In our setting where unit root regressors are driven by temporal dependent non-Gaussian innovations, we establish original probabilistic bounds for these two building blocks. The bounds imply that the rates of convergence of LASSO are different from those in the familiar cross sectional case. In practical applications given a mixture of stationary and nonstationary predictors, asymptotic guarantee of LASSO is preserved if all predictors are scale-standardized. In an empirical example of forecasting the unemployment rate with many macroeconomic time series, strong performance is delivered by LASSO when the initial specification is guided by macroeconomic domain expertise.
translated by 谷歌翻译
近似消息传递(AMP)是解决高维统计问题的有效迭代范式。但是,当迭代次数超过$ o \ big(\ frac {\ log n} {\ log log \ log \ log n} \时big)$(带有$ n $问题维度)。为了解决这一不足,本文开发了一个非吸附框架,用于理解峰值矩阵估计中的AMP。基于AMP更新的新分解和可控的残差项,我们布置了一个分析配方,以表征在存在独立初始化的情况下AMP的有限样本行为,该过程被进一步概括以进行光谱初始化。作为提出的分析配方的两个具体后果:(i)求解$ \ mathbb {z} _2 $同步时,我们预测了频谱初始化AMP的行为,最高为$ o \ big(\ frac {n} {\ mathrm {\ mathrm { poly} \ log n} \ big)$迭代,表明该算法成功而无需随后的细化阶段(如最近由\ citet {celentano2021local}推测); (ii)我们表征了稀疏PCA中AMP的非反应性行为(在尖刺的Wigner模型中),以广泛的信噪比。
translated by 谷歌翻译
The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance metric between distributions that has found utility in two-sample testing. The usual kernel-MMD test statistic is a degenerate U-statistic under the null, and thus it has an intractable limiting distribution. Hence, to design a level-$\alpha$ test, one usually selects the rejection threshold as the $(1-\alpha)$-quantile of the permutation distribution. The resulting nonparametric test has finite-sample validity but suffers from large computational cost, since every permutation takes quadratic time. We propose the cross-MMD, a new quadratic-time MMD test statistic based on sample-splitting and studentization. We prove that under mild assumptions, the cross-MMD has a limiting standard Gaussian distribution under the null. Importantly, we also show that the resulting test is consistent against any fixed alternative, and when using the Gaussian kernel, it has minimax rate-optimal power against local alternatives. For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.
translated by 谷歌翻译
稳定性选择(Meinshausen和Buhlmann,2010)通过返回许多副页面一致选择的功能来使任何特征选择方法更稳定。我们证明(在我们的知识中,它的知识,它的第一个结果),对于包含重要潜在变量的高度相关代理的数据,套索通常选择一个代理,但与套索的稳定性选择不能选择任何代理,导致比单独的套索更糟糕的预测性能。我们介绍集群稳定性选择,这利用了从业者的知识,即数据中存在高度相关的集群,从而产生比此设置中的稳定性选择更好的特征排名。我们考虑了几种特征组合方法,包括在每个重要集群中占据各个重要集群中的特征的加权平均值,其中重量由选择集群成员的频率决定,我们显示的是比以前的提案更好地导致更好的预测模型。我们呈现来自Meinshausen和Buhlmann(2010)和Shah和Samworth(2012)的理论担保的概括,以表明集群稳定选择保留相同的保证。总之,集群稳定性选择享有两个世界的最佳选择,产生既稳定的稀疏选择集,具有良好的预测性能。
translated by 谷歌翻译
We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called rank-one-out conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, in order to adapt to heteroskedascity in the data. Finally, we propose a model-free notion of variable importance, called leave-one-covariate-out or LOCO inference. Accompanying this paper is an R package conformalInference that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package.
translated by 谷歌翻译
We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation (TV): the sum of weighted absolute differences of parameters $\theta_i,\theta_j$ (which estimate the function values $f_0(x_i),f_0(x_j)$) at all neighboring cells $i,j$ in the Voronoi diagram. This is seen to be equivalent to a variational optimization problem that regularizes according to the usual continuum (measure-theoretic) notion of TV, once we restrict the domain to functions that are piecewise constant over the Voronoi diagram. The regression estimator under consideration hence performs (shrunken) local averaging over adaptively formed unions of Voronoi cells, and we refer to it as the Voronoigram, following the ideas in Koenker (2005), and drawing inspiration from Tukey's regressogram (Tukey, 1961). Our contributions in this paper span both the conceptual and theoretical frontiers: we discuss some of the unique properties of the Voronoigram in comparison to TV-regularized estimators that use other graph-based discretizations; we derive the asymptotic limit of the Voronoi TV functional; and we prove that the Voronoigram is minimax rate optimal (up to log factors) for estimating BV functions that are essentially bounded.
translated by 谷歌翻译
强大的机器学习模型的开发中的一个重要障碍是协变量的转变,当训练和测试集的输入分布时发生的分配换档形式在条件标签分布保持不变时发生。尽管现实世界应用的协变量转变普遍存在,但在现代机器学习背景下的理论理解仍然缺乏。在这项工作中,我们检查协变量的随机特征回归的精确高尺度渐近性,并在该设置中提出了限制测试误差,偏差和方差的精确表征。我们的结果激发了一种自然部分秩序,通过协变速转移,提供足够的条件来确定何时何时损害(甚至有助于)测试性能。我们发现,过度分辨率模型表现出增强的协会转变的鲁棒性,为这种有趣现象提供了第一个理论解释之一。此外,我们的分析揭示了分销和分发外概率性能之间的精确线性关系,为这一令人惊讶的近期实证观察提供了解释。
translated by 谷歌翻译