我们讨论了具有未知IV有效性的线性仪器变量(IV)模型中识别的基本问题。我们重新审视了流行的多数和多元化规则,并表明通常没有识别条件是“且仅在总体上”。假设“最稀少的规则”,该规则等同于多数规则,但在计算算法中变得运作,我们研究并证明了基于两步选择的其他IV估计器的非convex惩罚方法的优势,就两步选择而言选择一致性和单独弱IV的适应性。此外,我们提出了一种与识别条件保持一致的替代较低的惩罚,并同时提供甲骨文稀疏结构。与先前的文献相比,针对静脉强度较弱的估计仪得出了理想的理论特性。使用模拟证明了有限样本特性,并且选择和估计方法应用于有关贸易对经济增长的影响的经验研究。
translated by 谷歌翻译
数据稀疏问题是自然语言理解(NLU)的关键挑战,特别是对于新的目标域。通过在源域中训练NLU模型并直接将模型应用于任意目标域(即使没有微调),很少拍摄的NLU对缓解数据稀缺问题至关重要。在本文中,我们建议改进具有矢量投影距离和抽象三角条件随机场(CRF)的原型网络,用于几次射击NLU。向量投影距离利用在标签向量上的上下文词嵌入的投影作为单词标签相似度,其等同于归一化的线性模型。抽象三角CRF了解用于联合意图分类和插槽填充任务的域名忽视标签转换。广泛的实验表明,我们所提出的方法可以显着超越强力基线。具体而言,我们的方法可以在中文和英语中达到两次拍摄的两次拍摄NLU基准(几个关节和剪辑)的新技术,而无需对目标域的微调。
translated by 谷歌翻译
细粒度的视觉分类(FGVC)旨在识别子类别的对象。由于级细的级别差异,这是一个非常具有挑战性的任务。现有的研究将大型卷积神经网络或视觉变压器应用为特征提取器,这是极其计算昂贵的。实际上,实际的细粒度识别的场景通常需要更轻薄的移动网络可以离线使用。然而,基本移动网络特征提取能力比大规模模型弱。本文基于轻质MobileNetv2,我们提出了一种具有递归马赛克发生器(RMG-PMSI)的逐步多级交互训练方法。首先,我们提出了一种递归马赛克发生器(RMG),其产生不同阶段的不同粒度的图像。然后,不同阶段的特征通过多级相互作用(MSI)模块,其增强和补充不同阶段的相应特征。最后,使用渐进式训练(P),可以充分利用不同阶段中的模型提取的特征并彼此融合。三个着名的细粒度基准测试的实验表明,RMG-PMSI可以显着提高性能,具有良好的稳健性和可转移性。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译
Domain adaptive detection aims to improve the generalization of detectors on target domain. To reduce discrepancy in feature distributions between two domains, recent approaches achieve domain adaption through feature alignment in different granularities via adversarial learning. However, they neglect the relationship between multiple granularities and different features in alignment, degrading detection. Addressing this, we introduce a unified multi-granularity alignment (MGA)-based detection framework for domain-invariant feature learning. The key is to encode the dependencies across different granularities including pixel-, instance-, and category-levels simultaneously to align two domains. Specifically, based on pixel-level features, we first develop an omni-scale gated fusion (OSGF) module to aggregate discriminative representations of instances with scale-aware convolutions, leading to robust multi-scale detection. Besides, we introduce multi-granularity discriminators to identify where, either source or target domains, different granularities of samples come from. Note that, MGA not only leverages instance discriminability in different categories but also exploits category consistency between two domains for detection. Furthermore, we present an adaptive exponential moving average (AEMA) strategy that explores model assessments for model update to improve pseudo labels and alleviate local misalignment problem, boosting detection robustness. Extensive experiments on multiple domain adaption scenarios validate the superiority of MGA over other approaches on FCOS and Faster R-CNN detectors. Code will be released at https://github.com/tiankongzhang/MGA.
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译
Learning efficient and interpretable policies has been a challenging task in reinforcement learning (RL), particularly in the visual RL setting with complex scenes. While neural networks have achieved competitive performance, the resulting policies are often over-parameterized black boxes that are difficult to interpret and deploy efficiently. More recent symbolic RL frameworks have shown that high-level domain-specific programming logic can be designed to handle both policy learning and symbolic planning. However, these approaches rely on coded primitives with little feature learning, and when applied to high-dimensional visual scenes, they can suffer from scalability issues and perform poorly when images have complex object interactions. To address these challenges, we propose \textit{Differentiable Symbolic Expression Search} (DiffSES), a novel symbolic learning approach that discovers discrete symbolic policies using partially differentiable optimization. By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions, while also incorporating the strengths of neural networks for feature learning and optimization. Our experiments demonstrate that DiffSES is able to generate symbolic policies that are simpler and more and scalable than state-of-the-art symbolic RL methods, with a reduced amount of symbolic prior knowledge.
translated by 谷歌翻译