Spatial-temporal (ST) graph modeling, such as traffic speed forecasting and taxi demand prediction, is an important task in deep learning area. However, for the nodes in graph, their ST patterns can vary greatly in difficulties for modeling, owning to the heterogeneous nature of ST data. We argue that unveiling the nodes to the model in a meaningful order, from easy to complex, can provide performance improvements over traditional training procedure. The idea has its root in Curriculum Learning which suggests in the early stage of training models can be sensitive to noise and difficult samples. In this paper, we propose ST-Curriculum Dropout, a novel and easy-to-implement strategy for spatial-temporal graph modeling. Specifically, we evaluate the learning difficulty of each node in high-level feature space and drop those difficult ones out to ensure the model only needs to handle fundamental ST relations at the beginning, before gradually moving to hard ones. Our strategy can be applied to any canonical deep learning architecture without extra trainable parameters, and extensive experiments on a wide range of datasets are conducted to illustrate that, by controlling the difficulty level of ST relations as the training progresses, the model is able to capture better representation of the data and thus yields better generalization.
translated by 谷歌翻译
估计路径的旅行时间是智能运输系统的重要主题。它是现实世界应用的基础,例如交通监控,路线计划和出租车派遣。但是,为这样的数据驱动任务构建模型需要大量用户的旅行信息,这与其隐私直接相关,因此不太可能共享。数据所有者之间的非独立和相同分布的(非IID)轨迹数据也使一个预测模型变得极具挑战性,如果我们直接应用联合学习。最后,以前关于旅行时间估算的工作并未考虑道路的实时交通状态,我们认为这可以极大地影响预测。为了应对上述挑战,我们为移动用户组引入GOF-TTE,生成的在线联合学习框架以进行旅行时间估计,这是我)使用联合学习方法,允许在培训时将私人数据保存在客户端设备上,并设计设计和设计。所有客户共享的全球模型作为在线生成模型推断实时道路交通状态。 ii)除了在服务器上共享基本模型外,还针对每个客户调整了一个微调的个性化模型来研究其个人驾驶习惯,从而弥补了本地化全球模型预测的残余错误。 %iii)将全球模型设计为所有客户共享的在线生成模型,以推断实时道路交通状态。我们还对我们的框架采用了简单的隐私攻击,并实施了差异隐私机制,以进一步保证隐私安全。最后,我们对Didi Chengdu和Xi'an的两个现实世界公共出租车数据集进行了实验。实验结果证明了我们提出的框架的有效性。
translated by 谷歌翻译
由于物联网(IoT)技术的快速开发,许多在线Web应用程序(例如Google Map和Uber)估计移动设备收集的轨迹数据的旅行时间。但是,实际上,复杂的因素(例如网络通信和能量限制)使以低采样率收集的多个轨迹。在这种情况下,本文旨在解决稀疏场景中的旅行时间估计问题(TTE)和路线恢复问题,这通常会导致旅行时间的不确定标签以及连续采样的GPS点之间的路线。我们将此问题提出为不进行的监督问题,其中训练数据具有粗糙的标签,并共同解决了TTE和路线恢复的任务。我们认为,这两个任务在模型学习过程中彼此互补并保持这种关系:更精确的旅行时间可以使路由更好地推断,从而导致更准确的时间估计)。基于此假设,我们提出了一种EM算法,以替代E估计通过E步中通过弱监督的推断路线的行进时间,并根据M步骤中的估计行进时间来检索途径,以稀疏轨迹。我们对三个现实世界轨迹数据集进行了实验,并证明了该方法的有效性。
translated by 谷歌翻译
注释大规模数据集以进行监督的视频阴影检测方法是一项挑战。直接使用在标记的图像上训练的模型直接导致高概括错误和时间不一致的结果。在本文中,我们通过提出一个时空插值一致性训练(Stict)框架来解决这些挑战,以合理地将未标记的视频框架以及标记的图像以及图像阴影检测网络训练中进行合理地馈送。具体而言,我们提出了空间和时间ICT,其中定义了两个新的插值方案,\ textit {i.e。},空间插值和时间插值。然后,我们相应地得出了相应的空间和时间插值一致性约束,以增强像素智能分类任务中的概括和分别鼓励时间一致的预测。此外,我们设计了一个量表感知网络,用于图像中的多尺度阴影知识学习,并提出了比例一致性约束,以最大程度地减少不同尺度上预测之间的差异。我们提出的方法在VISHA数据集和自称数据集上得到了广泛的验证。实验结果表明,即使没有视频标签,我们的方法也比大多数最新的监督,半监督或无监督的图像/视频阴影检测方法以及相关任务中的其他方法更好。代码和数据集可在\ url {https://github.com/yihong-97/stict}上获得。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译
Domain adaptive detection aims to improve the generalization of detectors on target domain. To reduce discrepancy in feature distributions between two domains, recent approaches achieve domain adaption through feature alignment in different granularities via adversarial learning. However, they neglect the relationship between multiple granularities and different features in alignment, degrading detection. Addressing this, we introduce a unified multi-granularity alignment (MGA)-based detection framework for domain-invariant feature learning. The key is to encode the dependencies across different granularities including pixel-, instance-, and category-levels simultaneously to align two domains. Specifically, based on pixel-level features, we first develop an omni-scale gated fusion (OSGF) module to aggregate discriminative representations of instances with scale-aware convolutions, leading to robust multi-scale detection. Besides, we introduce multi-granularity discriminators to identify where, either source or target domains, different granularities of samples come from. Note that, MGA not only leverages instance discriminability in different categories but also exploits category consistency between two domains for detection. Furthermore, we present an adaptive exponential moving average (AEMA) strategy that explores model assessments for model update to improve pseudo labels and alleviate local misalignment problem, boosting detection robustness. Extensive experiments on multiple domain adaption scenarios validate the superiority of MGA over other approaches on FCOS and Faster R-CNN detectors. Code will be released at https://github.com/tiankongzhang/MGA.
translated by 谷歌翻译
In the scenario of black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful adversarial perturbation based on query feedback under a query budget. Due to the limited feedback information, existing query-based black-box attack methods often require many queries for attacking each benign example. To reduce query cost, we propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability. Specifically, by treating the attack on each benign example as one task, we develop a meta-learning framework by training a meta-generator to produce perturbations conditioned on benign examples. When attacking a new benign example, the meta generator can be quickly fine-tuned based on the feedback information of the new task as well as a few historical attacks to produce effective perturbations. Moreover, since the meta-train procedure consumes many queries to learn a generalizable generator, we utilize model-level adversarial transferability to train the meta-generator on a white-box surrogate model, then transfer it to help the attack against the target model. The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance, which is verified by extensive experiments.
translated by 谷歌翻译