最近利用多模式数据旨在建立面部动作单元(AU)检测模型的研究。但是,由于多模式数据的异质性,多模式表示学习成为主要挑战之一。一方面,很难通过仅通过一个特征提取器从多模式中提取相关特征,另一方面,先前的研究并未完全探索多模式融合策略的潜力。例如,早期融合通常需要在推理期间存在所有方式,而晚期融合和中间融合则增加了特征学习的网络大小。与晚期融合的大量工作相反,早期融合探索渠道信息的作品很少。本文提出了一个新型的多模式网络,称为多模式通道混合(MCM),作为一种预训练的模型,以学习强大的表示形式,以促进多模式融合。我们在自动面部动作单元检测的下游任务上评估学习的表示形式。具体而言,它是一个单个流编码器网络,该网络在早期融合中使用频道混合模块,在下游检测任务中仅需要一种模态。我们还利用蒙版的VIT编码器从融合图像中学习特征,并使用两个VIT解码器重建两个模式。我们已经在两个公共数据集(称为BP4D和DISFA)上进行了广泛的实验,以评估所提出的多模式框架的有效性和鲁棒性。结果表明我们的方法是可比或优越的,它与最新的基线方法相当。
translated by 谷歌翻译
已经对蜘蛛/莎拉/风暴等方差降低技术进行了广泛的研究,以提高随机非凸优化的收敛速率,这些优化通常维护和更新跨迭代中单个函数的估计器序列。 {\如果我们需要在迭代中跟踪多个功能映射,但是只有访问$ \ Mathcal {o}的随机样品(1)$在每次迭代时$ functional映射?}在解决一个新兴的家族时,有一个重要的应用程序以$ \ sum_ {i = 1}^m f_i(g_i(\ mathbf {w}))的形式形式的耦合组合优化问题,其中$ g_i $可通过随机甲骨文访问$ g_i $。关键问题是跟踪和估计$ \ mathbf g(\ mathbf {w})=(g_1(\ mathbf {w}),\ ldots,g_m(\ mathbf {w})$ $ \ mathbf g(\ mathbf {w})$具有$ m $块,只允许探测$ \ mathcal {o}(1)$块才能达到其随机值和雅各布人。为了提高解决这些问题的复杂性,我们提出了一种新型随机方法,称为多块单个探针差异(MSVR)估计器,以跟踪$ \ mathbf g(\ mathbf {w})$的序列。它的灵感来自风暴,但引入了定制的误差校正术语,不仅可以减轻所选块的随机样品中的噪声,而且还可以减轻那些未进行采样的块中的噪声。在MSVR估计器的帮助下,我们开发了几种算法来解决上述组成问题,并在具有非convex/convex/convex/strank strank convex目标的各种设置中具有改善的复杂性。我们的结果在几个方面都改善了先前的结果,包括样本复杂性和对强凸参数的依赖。多任务深度AUC最大化的经验研究表明,使用新估计器的性能更好。
translated by 谷歌翻译
在这项工作中,我们将解决方案介绍给Epic-Kitchens-100 2022动作检测挑战。提出了一阶段动作检测变压器(OADT)来对视频段的时间连接进行建模。借助OADT,可以同时识别类别和时间边界。在完成了从不同功能训练的多个OADT模型之后,我们的模型可以达到21.28 \%的动作图,并在操作检测挑战的测试集中排名第一。
translated by 谷歌翻译
在各种计算机视觉任务(例如对象检测,实例分段等)中,无监督的域适应至关重要。他们试图减少域偏差诱导的性能下降,同时还促进模型应用速度。域适应对象检测中的先前作品尝试使图像级和实例级别变化对准以最大程度地减少域差异,但是它们可能会使单级功能与图像级域适应中的混合级功能相结合,因为对象中的每个图像中的每个图像检测任务可能不止一个类和对象。为了通过单级对齐获得单级和混合级对齐方式,我们将功能的混合级视为新班级,并建议使用混合级$ h-divergence $,以供对象检测到实现均匀特征对准并减少负转移。然后,还提出了基于混合级$ h-Divergence $的语义一致性特征对齐模型(SCFAM)。为了改善单层和混合级的语义信息并完成语义分离,SCFAM模型提出了语义预测模型(SPM)和语义桥接组件(SBC)。然后根据SPM结果更改PIX域鉴别器损耗的重量,以减少样品不平衡。广泛使用的数据集上的广泛无监督域的适应实验说明了我们所提出的方法在域偏置设置中的强大对象检测。
translated by 谷歌翻译
Molecular conformation generation aims to generate three-dimensional coordinates of all the atoms in a molecule and is an important task in bioinformatics and pharmacology. Previous methods usually first predict the interatomic distances, the gradients of interatomic distances or the local structures (e.g., torsion angles) of a molecule, and then reconstruct its 3D conformation. How to directly generate the conformation without the above intermediate values is not fully explored. In this work, we propose a method that directly predicts the coordinates of atoms: (1) the loss function is invariant to roto-translation of coordinates and permutation of symmetric atoms; (2) the newly proposed model adaptively aggregates the bond and atom information and iteratively refines the coordinates of the generated conformation. Our method achieves the best results on GEOM-QM9 and GEOM-Drugs datasets. Further analysis shows that our generated conformations have closer properties (e.g., HOMO-LUMO gap) with the groundtruth conformations. In addition, our method improves molecular docking by providing better initial conformations. All the results demonstrate the effectiveness of our method and the great potential of the direct approach. The code is released at https://github.com/DirectMolecularConfGen/DMCG
translated by 谷歌翻译
在本文中,我们研究了从同步2D和3D数据共同估计光流量和场景流的问题。以前的方法使用复杂的管道,将联合任务分成独立阶段,或以“早期融合”或“迟到的”方式“的熔断器2D和3D信息。这种单尺寸适合的方法遭受了未能充分利用每个模态的特征的困境,或者最大化模态互补性。为了解决这个问题,我们提出了一个新的端到端框架,称为Camliflow。它由2D和3D分支组成,在特定层之间具有多个双向连接。与以前的工作不同,我们应用基于点的3D分支以更好地提取几何特征,并设计一个对称的学习操作员以保险熔断致密图像特征和稀疏点特征。我们还提出了一种转换,以解决3D-2D投影的非线性问题。实验表明,Camliflow以更少的参数实现了更好的性能。我们的方法在Kitti场景流基准上排名第一,表现出以1/7参数的前一篇文章。代码将可用。
translated by 谷歌翻译
Existing deep learning-based traffic forecasting models are mainly trained with MSE (or MAE) as the loss function, assuming that residuals/errors follow independent and isotropic Gaussian (or Laplacian) distribution for simplicity. However, this assumption rarely holds for real-world traffic forecasting tasks, where the unexplained residuals are often correlated in both space and time. In this study, we propose Spatiotemporal Residual Regularization by modeling residuals with a dynamic (e.g., time-varying) mixture of zero-mean multivariate Gaussian distribution with learnable spatiotemporal covariance matrices. This approach allows us to directly capture spatiotemporally correlated residuals. For scalability, we model the spatiotemporal covariance for each mixture component using a Kronecker product structure, which significantly reduces the number of parameters and computation complexity. We evaluate the performance of the proposed method on a traffic speed forecasting task. Our results show that, by properly modeling residual distribution, the proposed method not only improves the model performance but also provides interpretable structures.
translated by 谷歌翻译
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.
translated by 谷歌翻译
Spatiotemporal traffic data imputation is of great significance in intelligent transportation systems and data-driven decision-making processes. To make an accurate reconstruction on partially observed traffic data, we assert the importance of characterizing both global and local trends in traffic time series. In the literature, substantial prior works have demonstrated the effectiveness of utilizing low-rankness property of traffic data by matrix/tensor completion models. In this study, we first introduce a Laplacian kernel to temporal regularization for characterizing local trends in traffic time series, which can be formulated in the form of circular convolution. Then, we develop a low-rank Laplacian convolutional representation (LCR) model by putting the nuclear norm of a circulant matrix and the Laplacian temporal regularization together, which is proved to meet a unified framework that takes a fast Fourier transform solution in a relatively low time complexity. Through extensive experiments on some traffic datasets, we demonstrate the superiority of LCR for imputing traffic time series of various time series behaviors (e.g., data noises and strong/weak periodicity). The proposed LCR model is an efficient and effective solution to large-scale traffic data imputation over the existing baseline models. The adapted datasets and Python implementation are publicly available at https://github.com/xinychen/transdim.
translated by 谷歌翻译
Score-based modeling through stochastic differential equations (SDEs) has provided a new perspective on diffusion models, and demonstrated superior performance on continuous data. However, the gradient of the log-likelihood function, i.e., the score function, is not properly defined for discrete spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based modeling} to categorical data. In this paper, we extend diffusion models to discrete variables by introducing a stochastic jump process where the reverse process denoises via a continuous-time Markov chain. This formulation admits an analytical simulation during backward sampling. To learn the reverse process, we extend score matching to general categorical data and show that an unbiased estimator can be obtained via simple matching of the conditional marginal distributions. We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
translated by 谷歌翻译