The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner. This way often leads to severe performance degradation when they try to extrapolate a longer period of future, thus limiting the practical use of the prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursive manner and therefore prevents error accumulation. However, only a few MIMO models for video prediction are proposed and they only achieve inferior performance due to the date. The real strength of the MIMO model in this area is not well noticed and is largely under-explored. Motivated by that, we conduct a comprehensive investigation in this paper to thoroughly exploit how far a simple MIMO architecture can go. Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with longterm error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality. We believe our model can serve as a new baseline to facilitate the future research of video prediction tasks. The code will be released.
translated by 谷歌翻译
实时视频广播通常需要具有域知识的多种技能和专业知识,以实现多摄像头制作。随着摄像机的数量不断增加,指导现场运动广播现在比以往任何时候都变得更加复杂和挑战。在生产过程中,广播董事需要更加集中,响应,令人满意的知识。为了使董事免于其密集努力,我们开发了一个叫做智能总监的创新自动化体育广播指示系统,旨在模仿典型的人类循环广播过程,以实时自动创建近专业广播节目通过使用一组高级多视图视频分析算法。灵感来自于所谓的“三事”的体育广播建设,我们用一个由三个连续新型组件组成的事件驱动管道构建我们的系统:1)通过建模多视图相关性来检测事件的多视图事件定位2)多视图突出显示检测通过视图选择的视觉重视等级相机视图,3)自动广播调度程序来控制广播视频的生产。为了我们的最佳知识,我们的系统是用于多摄像机运动广播的第一个端到端的自动化指导系统,完全受到体育赛事的语义理解。它还是通过跨视网膜关系建模解决多视图联合事件检测的新问题的第一系统。我们对现实世界的多相机足球数据集进行客观和主观评估,这证明了我们的自动生成视频的质量与人类导向的质量相当。由于其更快的回应,我们的系统能够捕获更快速的快速和短期持续时间,通常由人道持有。
translated by 谷歌翻译
过去几年的对抗性文本攻击领域已经大大增长,其中常见的目标是加工可以成功欺骗目标模型的对抗性示例。然而,攻击的难以察觉,也是基本目标,通常被以前的研究遗漏。在这项工作中,我们倡导同时考虑两个目标,并提出一种新的多优化方法(被称为水合物转速),具有可提供的绩效保证,以实现高稳定性的成功攻击。我们通过基于分数和决策的设置,展示了HydroText通过广泛实验的效果,涉及五个基于基准数据集的现代NLP模型。与现有的最先进的攻击相比,Hydratext同时实现了更高的成功率,更低的修改率和与原始文本更高的语义相似性。人类评估研究表明,由水分精制成的对抗例保持良好的有效性和自然。最后,这些例子也表现出良好的可转移性,并且可以通过对抗性培训为目标模型带来显着的稳健性。
translated by 谷歌翻译
许多日常活动和心理物理实验涉及在工作记忆中保持多个项目。当物品采用连续值(例如,方向,对比度,长度,响度),它们必须以适当的尺寸的连续结构存储。我们调查如何通过培训经常性网络在神经电路中在神经电路中提出两个先前显示的刺激取向。我们发现两个方向的活动歧管类似于克利福德·托鲁斯。虽然夹层和标准圆环(甜甜圈的表面)是拓扑相当的,但它们具有重要的功能差异。克利福德·托鲁斯平等地对待两种方向,并使它们保持在正交子空间中,如任务所要求的,而标准的圆环没有。我们发现并表征了支持Clifford Torus的连接模式。此外,除了通过持久性活动存储信息的吸引子之外,我们的网络还使用动态代码,其中单位改变调谐以防止新的感官输入覆盖先前存储的输入。我们认为,每当多个输入通过共享连接输入存储器系统时,通常需要这种动态代码。最后,我们将我们的框架应用于人类心理物理学实验,其中受试者报告了两个记忆的方向。通过改变RNN的培训条件,我们测试和支持人类行为是神经噪声的产物的假设,并且依赖于两个取向之间的序数关系的更稳定和行为相关的记忆。这表明RNNS中的合适的归纳偏差对于揭示人脑如何实现工作记忆很重要。这些结果在一起,了解了一类视觉解码任务的神经计算,从人类行为缩小到突触连接。
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译
Due to their ability to offer more comprehensive information than data from a single view, multi-view (multi-source, multi-modal, multi-perspective, etc.) data are being used more frequently in remote sensing tasks. However, as the number of views grows, the issue of data quality becomes more apparent, limiting the potential benefits of multi-view data. Although recent deep neural network (DNN) based models can learn the weight of data adaptively, a lack of research on explicitly quantifying the data quality of each view when fusing them renders these models inexplicable, performing unsatisfactorily and inflexible in downstream remote sensing tasks. To fill this gap, in this paper, evidential deep learning is introduced to the task of aerial-ground dual-view remote sensing scene classification to model the credibility of each view. Specifically, the theory of evidence is used to calculate an uncertainty value which describes the decision-making risk of each view. Based on this uncertainty, a novel decision-level fusion strategy is proposed to ensure that the view with lower risk obtains more weight, making the classification more credible. On two well-known, publicly available datasets of aerial-ground dual-view remote sensing images, the proposed approach achieves state-of-the-art results, demonstrating its effectiveness. The code and datasets of this article are available at the following address: https://github.com/gaopiaoliang/Evidential.
translated by 谷歌翻译
With the development of technology and sharing economy, Airbnb as a famous short-term rental platform, has become the first choice for many young people to select. The issue of Airbnb's pricing has always been a problem worth studying. While the previous studies achieve promising results, there are exists deficiencies to solve. Such as, (1) the feature attributes of rental are not rich enough; (2) the research on rental text information is not deep enough; (3) there are few studies on predicting the rental price combined with the point of interest(POI) around the house. To address the above challenges, we proposes a multi-source information embedding(MSIE) model to predict the rental price of Airbnb. Specifically, we first selects the statistical feature to embed the original rental data. Secondly, we generates the word feature vector and emotional score combination of three different text information to form the text feature embedding. Thirdly, we uses the points of interest(POI) around the rental house information generates a variety of spatial network graphs, and learns the embedding of the network to obtain the spatial feature embedding. Finally, this paper combines the three modules into multi source rental representations, and uses the constructed fully connected neural network to predict the price. The analysis of the experimental results shows the effectiveness of our proposed model.
translated by 谷歌翻译
Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.
translated by 谷歌翻译
Neural operators, which emerge as implicit solution operators of hidden governing equations, have recently become popular tools for learning responses of complex real-world physical systems. Nevertheless, the majority of neural operator applications has thus far been data-driven, which neglects the intrinsic preservation of fundamental physical laws in data. In this paper, we introduce a novel integral neural operator architecture, to learn physical models with fundamental conservation laws automatically guaranteed. In particular, by replacing the frame-dependent position information with its invariant counterpart in the kernel space, the proposed neural operator is by design translation- and rotation-invariant, and consequently abides by the conservation laws of linear and angular momentums. As applications, we demonstrate the expressivity and efficacy of our model in learning complex material behaviors from both synthetic and experimental datasets, and show that, by automatically satisfying these essential physical laws, our learned neural operator is not only generalizable in handling translated and rotated datasets, but also achieves state-of-the-art accuracy and efficiency as compared to baseline neural operator models.
translated by 谷歌翻译
Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as the face recognition or face detection, is unfortunately paid little attention to in literature for detecting the manipulated face images. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information to tackle the problem of face manipulation detection in real world applications. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from a RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.
translated by 谷歌翻译