ProtoPNet and its follow-up variants (ProtoPNets) have attracted broad research interest for their intrinsic interpretability from prototypes and comparable accuracy to non-interpretable counterparts. However, it has been recently found that the interpretability of prototypes can be corrupted due to the semantic gap between similarity in latent space and that in input space. In this work, we make the first attempt to quantitatively evaluate the interpretability of prototype-based explanations, rather than solely qualitative evaluations by some visualization examples, which can be easily misled by cherry picks. To this end, we propose two evaluation metrics, termed consistency score and stability score, to evaluate the explanation consistency cross images and the explanation robustness against perturbations, both of which are essential for explanations taken into practice. Furthermore, we propose a shallow-deep feature alignment (SDFA) module and a score aggregation (SA) module to improve the interpretability of prototypes. We conduct systematical evaluation experiments and substantial discussions to uncover the interpretability of existing ProtoPNets. Experiments demonstrate that our method achieves significantly superior performance to the state-of-the-arts, under both the conventional qualitative evaluations and the proposed quantitative evaluations, in both accuracy and interpretability. Codes are available at https://github.com/hqhQAQ/EvalProtoPNet.
translated by 谷歌翻译
神经网络(NNS)和决策树(DTS)都是机器学习的流行模型,但具有相互排斥的优势和局限性。为了带来两个世界中的最好,提出了各种方法来明确或隐式地集成NN和DTS。在这项调查中,这些方法是在我们称为神经树(NTS)的学校中组织的。这项调查旨在对NTS进行全面审查,并尝试确定它们如何增强模型的解释性。我们首先提出了NTS的彻底分类学,该分类法表达了NNS和DTS的逐步整合和共同进化。之后,我们根据NTS的解释性和绩效分析,并建议解决其余挑战的可能解决方案。最后,这项调查以讨论有条件计算和向该领域的有希望的方向进行讨论结束。该调查中审查的论文列表及其相应的代码可在以下网址获得:https://github.com/zju-vipa/awesome-neural-trees
translated by 谷歌翻译
原型零件网络(Protopnet)引起了广泛的关注,并增加了许多随访研究,因为它的自我解释特性可解释人工智能(XAI)。但是,当直接在视觉变压器(VIT)骨架上应用原始网络时,学到的原型存在“分心”问题:它们具有相对较高的可能性,即被背景激活,并且对前景的关注较少。建模长期依赖性的强大能力使得基于变压器的Protopnet难以专注于原型部分,从而严重损害了其固有的解释性。本文提出了原型零件变压器(ProtoPformer),以适当有效地应用基于原型的方法,并使用VIT进行可解释的图像识别。提出的方法介绍了根据VIT的建筑特征捕获和突出目标的代表性整体和部分特征的全局和局部原型。采用了全球原型,以提供对象的全球视图,以指导本地原型集中在前景上,同时消除背景的影响。之后,明确监督局部原型,以专注于它们各自的原型视觉部分,从而提高整体可解释性。广泛的实验表明,我们提出的全球和本地原型可以相互纠正并共同做出最终决策,这些决策分别忠实,透明地从整体和地方的角度缔合过程。此外,ProtoPformer始终取得优于基于原型的原型基线(SOTA)的卓越性能和可视化结果。我们的代码已在https://github.com/zju-vipa/protopformer上发布。
translated by 谷歌翻译
最近,视觉变形金刚(VITS)正在快速发展,并开始挑战计算机视觉(CV)领域的卷积神经网络(CNNS)的统治。利用用于更换卷积的硬编码的感应偏差的通用变压器架构,VITS已经超过了CNN,尤其是数据充足的情况。然而,VITS容易超过小型数据集,因此依靠大规模的预训练,这花费了巨大的时间。在本文中,我们努力通过引入CNNS的归纳偏见来解放VITS,通过返回vits,同时保留其网络架构以获得更高的上限并设置更合适的优化目标。首先,代理CNN基于具有感应偏差的给定韦尔设计。然后提出了一种自举训练算法,共同优化了重量共享的药剂和vit,在此期间,VIT学习来自代理的中间特征的诱导偏差。具有有限培训数据的CiFar-10/100和Imagenet-1k上的广泛实验表明,令人鼓舞的结果,感应偏差有助于VITS更快地收敛,甚至更少的参数。
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译
Improving the visual quality of the given degraded observation by correcting exposure level is a fundamental task in the computer vision community. Existing works commonly lack adaptability towards unknown scenes because of the data-driven patterns (deep networks) and limited regularization (traditional optimization), and they usually need time-consuming inference. These two points heavily limit their practicability. In this paper, we establish a Practical Exposure Corrector (PEC) that assembles the characteristics of efficiency and performance. To be concrete, we rethink the exposure correction to provide a linear solution with exposure-sensitive compensation. Around generating the compensation, we introduce an exposure adversarial function as the key engine to fully extract valuable information from the observation. By applying the defined function, we construct a segmented shrinkage iterative scheme to generate the desired compensation. Its shrinkage nature supplies powerful support for algorithmic stability and robustness. Extensive experimental evaluations fully reveal the superiority of our proposed PEC. The code is available at https://rsliu.tech/PEC.
translated by 谷歌翻译
Sparse principal component analysis (SPCA) has been widely used for dimensionality reduction and feature extraction in high-dimensional data analysis. Despite there are many methodological and theoretical developments in the past two decades, the theoretical guarantees of the popular SPCA algorithm proposed by Zou, Hastie & Tibshirani (2006) based on the elastic net are still unknown. We aim to close this important theoretical gap in this paper. We first revisit the SPCA algorithm of Zou et al. (2006) and present our implementation. Also, we study a computationally more efficient variant of the SPCA algorithm in Zou et al. (2006) that can be considered as the limiting case of SPCA. We provide the guarantees of convergence to a stationary point for both algorithms. We prove that, under a sparse spiked covariance model, both algorithms can recover the principal subspace consistently under mild regularity conditions. We show that their estimation error bounds match the best available bounds of existing works or the minimax rates up to some logarithmic factors. Moreover, we demonstrate the numerical performance of both algorithms in simulation studies.
translated by 谷歌翻译