Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet. However, they either ignore the effect of representative prototypes or fail to enhance the prototypes with multimodal information adequately. In this work, we propose a novel Multimodal Prototype-Enhanced Network (MORN) to use the semantic information of label texts as multimodal information to enhance prototypes, including two modality flows. A CLIP visual encoder is introduced in the visual flow, and visual prototypes are computed by the Temporal-Relational CrossTransformer (TRX) module. A frozen CLIP text encoder is introduced in the text flow, and a semantic-enhanced module is used to enhance text features. After inflating, text prototypes are obtained. The final multimodal prototypes are then computed by a multimodal prototype-enhanced module. Besides, there exist no evaluation metrics to evaluate the quality of prototypes. To the best of our knowledge, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE), which is used to evaluate the performance of prototypes in discriminating different categories. We conduct extensive experiments on four popular datasets. MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we explore the correlation between PRIDE and accuracy.
translated by 谷歌翻译
尽管变形金刚及其变体构象体在语音识别方面表现出了有希望的表现,但参数化的属性在训练和推理过程中导致了很大的记忆成本。一些作品使用跨层重量分享来减少模型的参数。但是,不可避免的能力损失会损害模型性能。为了解决这个问题,本文提出了通过共享稀疏门控专家的参数效率构象异构体。具体而言,我们使用稀疏门控的专家(MOE)来扩展构型块的容量而不增加计算。然后,共享分组构象块的参数,以减少参数的数量。接下来,为了确保具有不同级别适应表示的灵活性的共享块,我们会单独设计MOE路由器和标准化。此外,我们使用知识蒸馏来进一步提高性能。实验结果表明,与全参数模型相比,所提出的模型用编码器的1/3来实现竞争性能。
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
我们将Dino(\ textbf {d} etr与\ textbf {i} mpred de \ textbf {n} oising hand \ textbf {o} r boxes),一种最先进的端到端对象检测器。 % 在本文中。 Dino通过使用一种对比度方法来降级训练,一种用于锚定初始化的混合查询选择方法以及对盒子预测的两次方案,通过使用对比的方式来改善性能和效率的模型。 Dino在$ 12 $时代获得$ 49.4 $ ap,$ 12.3 $ ap in Coco $ 24 $时期,带有Resnet-50骨干和多尺度功能,可显着改善$ \ textbf {+6.0} $ \ textbf {ap}和ap {ap}和ap}和$ \ textbf {+2.7} $ \ textbf {ap}与以前的最佳detr样模型相比,分别是dn-detr。 Dino在模型大小和数据大小方面都很好地缩放。没有铃铛和哨子,在对objects365数据集进行了swinl骨架的预训练后,Dino在两个Coco \ texttt {val2017}($ \ textbf {63.2} $ \ textbf {ap ap})和\ testtt { -dev}(\ textbf {$ \ textbf {63.3} $ ap})。与排行榜上的其他模型相比,Dino大大降低了其模型大小和预训练数据大小,同时实现了更好的结果。我们的代码将在\ url {https://github.com/ideacvr/dino}提供。
translated by 谷歌翻译
We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement ($+1.9$AP) under the same setting and achieves the best result (AP $43.4$ and $48.6$ with $12$ and $50$ epochs of training respectively) among DETR-like methods with ResNet-$50$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with $50\%$ training epochs. Code is available at \url{https://github.com/FengLi-ust/DN-DETR}.
translated by 谷歌翻译
在本文中,我们研究了使用深层学习技术预测外汇货币对未来波动性的问题。我们逐步展示如何通过对白天波动率的经验模式的指导来构建深度学习网络。数值结果表明,与传统的基线(即自回归和GARCH模型)相比,多尺寸长的短期内存(LSTM)模型与多货币对的输入相比一致地实现了最先进的准确性,即自动增加和加入模型其他深度学习模式。
translated by 谷歌翻译
布尔匹匹配对于数字集成电路设计非常重要。即使对于只有几个变量的函数,粗糙的布尔匹匹配的详尽方法也是昂贵的,因为这种算法对于N变量布尔函数的算法的时间复杂度是$ O(2 ^ {n + 1} n!)$。灵敏度是一个重要的特征,以及布尔函数复杂性的衡量标准。它已被用于分析不同领域算法的复杂性。该措施可以被视为布尔函数的签名,并且具有很大的潜力,可以帮助减少布尔匹匹配的搜索空间。在本文中,我们将布尔敏感性介绍到布尔匹配和设计几个相关的相关象征中,以增强快速布尔匹匹配。首先,我们提出了一些与布尔等价的敏感性相关的新签名。然后,我们证明了这些签名是布尔匹匹配的先决条件,我们可以使用它来减少匹配问题的搜索空间。此外,我们开发了一种快速的灵敏度计算方法来计算和比较两个布尔函数的这些签名。与传统的辅助因子和对称检测方法相比,灵敏度是另一个维度的一系列签名。我们还表明,可以轻松地集成到传统方法中的灵敏度,并将不匹配的布尔函数更快地区分。据我们所知,这是第一个向布尔匹配引入敏感性的工作。实验结果表明,我们在本文中提出的敏感性相关签名可以在很大程度上将搜索空间减少,并且通过最先进的布尔匹匹配方法执行高达3倍的加速。
translated by 谷歌翻译
合成数据是一种新兴技术,可以显着加快AI机器学习管道的开发和部署。在这项工作中,我们通过将连续时间随机模型与新提出的签名$ W_1 $公制组合,开发高保真时间序列发生器,SIGWGAN。前者是基于随机微分方程的Logsig-RNN模型,而后者源自通用和原则性的数学特征,以表征时间序列引起的度量。Sigwgan允许在产生高保真样本的同时在监督学习中转向计算上的GaN Min-Max问题。我们验证了由流行的量化风险模型和经验财务数据产生的合成数据的提出模型。代码在https://github.com/sigcgans/sig-wassersein-gans.git上获得。
translated by 谷歌翻译
本文有助于识别基于骨架的人类行动认可。关键步骤是开发一种通用网络架构,以提取用于时空骨架数据的判别特征。在本文中,我们提出了一种新型模块,即Logsig-RNN,其是日志签名层和复发类型神经网络(RNN)的组合。前者来自数学上的签名技术和记录签名作为流数据的表示,可以管理高采样率流,非均匀采样和变量长度的时间序列。它用作复发层的增强,可以方便地插入神经网络。此外,我们提出了两个路径转换层,以显着降低路径尺寸,同时保留进入Logsig-RNN模块的基本信息。最后,数值结果表明,在SOTA网络中通过LOGSIG-RNN模块替换RNN模块一致地提高了在精度和鲁棒性方面的Chalearn手势数据和NTU RGB + D 120动作数据上的性能。特别是,我们通过将简单的路径转换层与Logsig-RNN组合来实现Chalearn2013手势数据的最先进的准确性。代码可在https://github.com/steveliao93/gcn_logsigrnn获得。
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译