Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
High order structures (cavities and cliques) of the gene network of influenza A virus reveal tight associations among viruses during evolution and are key signals that indicate viral cross-species infection and cause pandemics. As indicators for sensing the dynamic changes of viral genes, these higher order structures have been the focus of attention in the field of virology. However, the size of the viral gene network is usually huge, and searching these structures in the networks introduces unacceptable delay. To mitigate this issue, in this paper, we propose a simple-yet-effective model named HyperSearch based on deep learning to search cavities in a computable complex network for influenza virus genetics. Extensive experiments conducted on a public influenza virus dataset demonstrate the effectiveness of HyperSearch over other advanced deep-learning methods without any elaborated model crafting. Moreover, HyperSearch can finish the search works in minutes while 0-1 programming takes days. Since the proposed method is simple and easy to be transferred to other complex networks, HyperSearch has the potential to facilitate the monitoring of dynamic changes in viral genes and help humans keep up with the pace of virus mutations.
translated by 谷歌翻译
古本(Guzheng)是一种具有多种演奏技巧的传统中国乐器。乐器演奏技术(IPT)在音乐表演中起着重要作用。但是,大多数现有的IPT检测作品显示出可变长度音频的效率低下,并且在概括方面没有保证,因为它们依靠单个声音库进行训练和测试。在这项研究中,我们建议使用可应用于可变长度音频的完全卷积网络提出了一个端到端的古兴游戏检测系统。由于每种古季的演奏技术都应用于音符,因此对专用的发作探测器进行了训练,可以将音频分为几个音符,并将其预测与框架IPT的预测融合在一起。在融合过程中,我们在每个音符内部添加IPT预测框架,并在每个音符中获得最高概率的IPT作为该注释的最终输出。我们创建了一个来自多个声音银行的名为GZ_ISOTECH的新数据集,并创建了Guzheng性能分析的现实世界录制。我们的方法在框架级准确性和80.76%的笔记级F1得分方面达到了87.97%,超过了现有的作品,这表明我们提出的方法在IPT检测中的有效性。
translated by 谷歌翻译
联合学习框架通常需要协作者共享共同模型的本地渐变更新,而不是共享培训数据以保留隐私。但是,在梯度泄漏攻击的事先工作表明,可以从梯度揭示私人培训数据。到目前为止,几乎所有相关工程都基于完全连接或卷积神经网络的攻击。鉴于近期适应变压器以解决多种愿景任务的绝大多大浪潮,调查视觉变压器的隐私风险是非常有价值的。在本文中,我们分析了基于自我关注机制的渐变泄漏风险,以理论和实用的方式。特别是,我们提出了4月 - 注意隐私泄漏,这对自我关注的博览会造成了强烈的威胁,如vit。展示视觉变压器如何通过梯度泄露隐私泄漏的风险,我们敦促设计隐私更安全的变压器模型和防守方案的重要性。
translated by 谷歌翻译
多视图检测包含多个相机视图,以减轻拥挤的场景中的闭塞,最先进的方法采用单独的转换来将多视图功能投影到地面平面。然而,我们发现这些2D变换不考虑物体的高度,并且这种疏忽沿着相同对象的垂直方向的忽略特征可能不会投影到相同的接地平面上,导致不纯的接地平面特征。为了解决这个问题,我们提出了VFA,Voxized 3D特征聚合,用于多视图检测中的功能转换和聚合。具体而言,我们将3D空间体制出来,将体素投影到每个相机视图上,并将2D功能与这些投影的体素相关联。这允许我们沿相同的垂直线识别然后聚合2D特征,在很大程度上减轻投影失真。此外,由于不同种类的物体(人与牛)在地面上具有不同的形状,因此我们引入了定向的高斯编码以匹配这种形状,从而提高准确性和效率。我们对多视图2D检测和多视图3D检测问题进行实验。结果四个数据集(包括新引入的Multiviewc数据集)表明,与最先进的方法相比,我们的系统与最有竞争力。 %我们的代码和数据将是开放的.code和multiviewc在https://github.com/robert-mar/vfa发布。
translated by 谷歌翻译
近年来,场景文本检测和识别的研究重点已转移到任意形状文本,文本形状表示是一个基本问题。理想的表示应紧凑,完整,高效和可重复使用,以便我们认为后续认可。但是,以前的表示在一个或多个方面存在缺陷。薄板间隙(TPS)转换在场景文本识别方面取得了巨大成功。受到这一点的启发,我们逆转了它的用法,并精致地将TPS视为任意形状文本表示的精美表示。 TPS表示是紧凑,完整和有效的。使用预测的TPS参数,可以将检测到的文本区域直接纠正到近冬季的参数,以帮助后续识别。为了进一步利用TPS表示的潜力,提出了边界对准损失。基于这些设计,我们实现了文本检测器tpsnet,可以方便地将其扩展到文本次数。对几个公共基准的广泛评估和消融表明,提出的文本表示和斑点方法的有效性和优势。特别是,TPSNET在ART数据集上实现了4.4 \%(78.4 \%vs. 74.0 \%)的检测F量改进,并且在5.0 \%(78.5 \%vs. 73.55)上进行了端到端的斑点f-Measure改进。 \%)在总文本上,这是没有铃铛和口哨的大边缘。
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Robust prediction of citywide traffic flows at different time periods plays a crucial role in intelligent transportation systems. While previous work has made great efforts to model spatio-temporal correlations, existing methods still suffer from two key limitations: i) Most models collectively predict all regions' flows without accounting for spatial heterogeneity, i.e., different regions may have skewed traffic flow distributions. ii) These models fail to capture the temporal heterogeneity induced by time-varying traffic patterns, as they typically model temporal correlations with a shared parameterized space for all time periods. To tackle these challenges, we propose a novel Spatio-Temporal Self-Supervised Learning (ST-SSL) traffic prediction framework which enhances the traffic pattern representations to be reflective of both spatial and temporal heterogeneity, with auxiliary self-supervised learning paradigms. Specifically, our ST-SSL is built over an integrated module with temporal and spatial convolutions for encoding the information across space and time. To achieve the adaptive spatio-temporal self-supervised learning, our ST-SSL first performs the adaptive augmentation over the traffic flow graph data at both attribute- and structure-levels. On top of the augmented traffic graph, two SSL auxiliary tasks are constructed to supplement the main traffic prediction task with spatial and temporal heterogeneity-aware augmentation. Experiments on four benchmark datasets demonstrate that ST-SSL consistently outperforms various state-of-the-art baselines. Since spatio-temporal heterogeneity widely exists in practical datasets, the proposed framework may also cast light on other spatial-temporal applications. Model implementation is available at https://github.com/Echo-Ji/ST-SSL.
translated by 谷歌翻译
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to fine-tune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting superior performance. In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts. Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and TextCaps using a unified model.
translated by 谷歌翻译
Motivation: Enhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many state-of-the-art computational methods have been proposed in order to efficiently identify enhancers, learning globally contextual features is still one of the challenges for computational methods. Regarding the similarities between biological sequences and natural language sentences, the novel BERT-based language techniques have been applied to extracting complex contextual features in various computational biology tasks such as protein function/structure prediction. To speed up the research on enhancer identification, it is urgent to construct a BERT-based enhancer language model. Results: In this paper, we propose a multi-scale enhancer identification method (iEnhancer-ELM) based on enhancer language models, which treat enhancer sequences as natural language sentences that are composed of k-mer nucleotides. iEnhancer-ELM can extract contextual information of multi-scale k-mers with positions from raw enhancer sequences. Benefiting from the complementary information of k-mers in multi-scale, we ensemble four iEnhancer-ELM models for improving enhancer identification. The benchmark comparisons show that our model outperforms state-of-the-art methods. By the interpretable attention mechanism, we finds 30 biological patterns, where 40% (12/30) are verified by a widely used motif tool (STREME) and a popular dataset (JASPAR), demonstrating our model has a potential ability to reveal the biological mechanism of enhancer. Availability: The source code are available at https://github.com/chen-bioinfo/iEnhancer-ELM Contact: junjiechen@hit.edu.cn and junjie.chen.hit@gmail.com; Supplementary information: Supplementary data are available at Bioinformatics online.
translated by 谷歌翻译