The dynamics of a turbulent flow tend to occupy only a portion of the phase space at a statistically stationary regime. From a dynamical systems point of view, this portion is the attractor. The knowledge of the turbulent attractor is useful for two purposes, at least: (i) We can gain physical insight into turbulence (what is the shape and geometry of the attractor?), and (ii) it provides the minimal number of degrees of freedom to accurately describe the turbulent dynamics. Autoencoders enable the computation of an optimal latent space, which is a low-order representation of the dynamics. If properly trained and correctly designed, autoencoders can learn an approximation of the turbulent attractor, as shown by Doan, Racca and Magri (2022). In this paper, we theoretically interpret the transformations of an autoencoder. First, we remark that the latent space is a curved manifold with curvilinear coordinates, which can be analyzed with simple tools from Riemann geometry. Second, we characterize the geometrical properties of the latent space. We mathematically derive the metric tensor, which provides a mathematical description of the manifold. Third, we propose a method -- proper latent decomposition (PLD) -- that generalizes proper orthogonal decomposition of turbulent flows on the autoencoder latent space. This decomposition finds the dominant directions in the curved latent space. This theoretical work opens up computational opportunities for interpreting autoencoders and creating reduced-order models of turbulent flows.
translated by 谷歌翻译
视觉变压器(VITS)具有与卷积神经网络相比,具有较小的感应偏置的根本不同的结构。随着绩效的提高,VIT的安全性和鲁棒性也非常重要。与许多最近利用VIT反对对抗性例子的鲁棒性的作品相反,本文调查了代表性的病因攻击,即后门。我们首先检查了VIT对各种后门攻击的脆弱性,发现VIT也很容易受到现有攻击的影响。但是,我们观察到,VIT的清洁数据准确性和后门攻击成功率在位置编码之前对补丁转换做出了明显的反应。然后,根据这一发现,我们为VIT提出了一种通过补丁处理来捍卫基于补丁的触发后门攻击的有效方法。在包括CIFAR10,GTSRB和Tinyimagenet在内的几个基准数据集上评估了这些表演,这些数据表明,该拟议的新颖防御在减轻VIT的后门攻击方面非常成功。据我们所知,本文提出了第一个防御性策略,该策略利用了反对后门攻击的VIT的独特特征。
translated by 谷歌翻译
基于硬件的加速度是促进许多计算密集型数学操作的广泛尝试。本文提出了一个基于FPGA的体系结构来加速卷积操作 - 在许多卷积神经网络模型中出现的复杂且昂贵的计算步骤。我们将设计定为标准卷积操作,打算以边缘-AI解决方案启动产品。该项目的目的是产生一个可以一次处理卷积层的FPGA IP核心。系统开发人员可以使用Verilog HDL作为体系结构的主要设计语言来部署IP核心。实验结果表明,我们在简单的边缘计算FPGA板上合成的单个计算核心可以提供0.224 GOPS。当董事会充分利用时,可以实现4.48 GOP。
translated by 谷歌翻译
Physics-Informed Neural Networks (PINNs) have gained much attention in various fields of engineering thanks to their capability of incorporating physical laws into the models. PINNs integrate the physical constraints by minimizing the partial differential equations (PDEs) residuals on a set of collocation points. The distribution of these collocation points appears to have a huge impact on the performance of PINNs and the assessment of the sampling methods for these points is still an active topic. In this paper, we propose a Fixed-Budget Online Adaptive Mesh Learning (FBOAML) method, which decomposes the domain into sub-domains, for training collocation points based on local maxima and local minima of the PDEs residuals. The stopping criterion is based on a data set of reference, which leads to an adaptive number of iterations for each specific problem. The effectiveness of FBOAML is demonstrated in the context of non-parameterized and parameterized problems. The impact of the hyper-parameters in FBOAML is investigated in this work. The comparison with other adaptive sampling methods is also illustrated. The numerical results demonstrate important gains in terms of accuracy of PINNs with FBOAML over the classical PINNs with non-adaptive collocation points. We also apply FBOAML in a complex industrial application involving coupling between mechanical and thermal fields. We show that FBOAML is able to identify the high-gradient location and even give better prediction for some physical fields than the classical PINNs with collocation points taken on a pre-adapted finite element mesh.
translated by 谷歌翻译
We propose a combined three pre-trained language models (XLM-R, BART, and DeBERTa-V3) as an empower of contextualized embedding for named entity recognition. Our model achieves a 92.9% F1 score on the test set and ranks 5th on the leaderboard at NL4Opt competition subtask 1.
translated by 谷歌翻译
Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.
translated by 谷歌翻译
Video anomaly detection (VAD) -- commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature -- is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model long- and short-range temporal dependencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study conducted on each component confirms its effectiveness in the problem, and the extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on two commonly-used benchmark datasets in the VAD problem (UCF-Crime and ShanghaiTech Campus). The source code will be made publicly available upon acceptance.
translated by 谷歌翻译
There is no settled universal 3D representation for geometry with many alternatives such as point clouds, meshes, implicit functions, and voxels to name a few. In this work, we present a new, compelling alternative for representing shapes using a sequence of cross-sectional closed loops. The loops across all planes form an organizational hierarchy which we leverage for autoregressive shape synthesis and editing. Loops are a non-local description of the underlying shape, as simple loop manipulations (such as shifts) result in significant structural changes to the geometry. This is in contrast to manipulating local primitives such as points in a point cloud or a triangle in a triangle mesh. We further demonstrate that loops are intuitive and natural primitive for analyzing and editing shapes, both computationally and for users.
translated by 谷歌翻译
The biomedical imaging world is notorious for working with small amounts of data, frustrating state-of-the-art efforts in the computer vision and deep learning worlds. With large datasets, it is easier to make progress we have seen from the natural image distribution. It is the same with microscopy videos of neuron cells moving in a culture. This problem presents several challenges as it can be difficult to grow and maintain the culture for days, and it is expensive to acquire the materials and equipment. In this work, we explore how to alleviate this data scarcity problem by synthesizing the videos. We, therefore, take the recent work of the video diffusion model to synthesize videos of cells from our training dataset. We then analyze the model's strengths and consistent shortcomings to guide us on improving video generation to be as high-quality as possible. To improve on such a task, we propose modifying the denoising function and adding motion information (dense optical flow) so that the model has more context regarding how video frames transition over time and how each pixel changes over time.
translated by 谷歌翻译
Adversarial machine learning has been both a major concern and a hot topic recently, especially with the ubiquitous use of deep neural networks in the current landscape. Adversarial attacks and defenses are usually likened to a cat-and-mouse game in which defenders and attackers evolve over the time. On one hand, the goal is to develop strong and robust deep networks that are resistant to malicious actors. On the other hand, in order to achieve that, we need to devise even stronger adversarial attacks to challenge these defense models. Most of existing attacks employs a single $\ell_p$ distance (commonly, $p\in\{1,2,\infty\}$) to define the concept of closeness and performs steepest gradient ascent w.r.t. this $p$-norm to update all pixels in an adversarial example in the same way. These $\ell_p$ attacks each has its own pros and cons; and there is no single attack that can successfully break through defense models that are robust against multiple $\ell_p$ norms simultaneously. Motivated by these observations, we come up with a natural approach: combining various $\ell_p$ gradient projections on a pixel level to achieve a joint adversarial perturbation. Specifically, we learn how to perturb each pixel to maximize the attack performance, while maintaining the overall visual imperceptibility of adversarial examples. Finally, through various experiments with standardized benchmarks, we show that our method outperforms most current strong attacks across state-of-the-art defense mechanisms, while retaining its ability to remain clean visually.
translated by 谷歌翻译