Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs are skill-demanding, time-consuming, and non-scalable to batch production. Although generative models emerge to make design automation no longer utopian, it remains non-trivial to customize designs that comply with designers' multimodal desires, i.e., constrained by background images and driven by foreground contents. In this study, we propose \textit{LayoutDETR} that inherits the high quality and realism from generative modeling, in the meanwhile reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal elements in a layout. Experiments validate that our solution yields new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ads banner dataset. For practical usage, we build our solution into a graphical system that facilitates user studies. We demonstrate that our designs attract more subjective preference than baselines by significant margins. Our code, models, dataset, graphical system, and demos are available at https://github.com/salesforce/LayoutDETR.
translated by 谷歌翻译
Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to fine-tune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting superior performance. In the inference stage, our model is able to generate desired stylized captions by choosing the corresponding prompts. Extensive experiments verify the controllable capability of the proposed method. Notably, we achieve outstanding performance on two diverse image captioning benchmarks including COCO Karpathy split and TextCaps using a unified model.
translated by 谷歌翻译
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
资金机构在很大程度上依赖于领域专家与研究建议之间的主题匹配来分配提案审查员。随着建议越来越跨学科,概述提案的跨学科性质是一项挑战,此后,找到具有适当专业知识的专家审阅者。解决这一挑战的重要步骤是准确对建议的跨学科标签进行分类。现有的方法论和申请相关文献,例如文本分类和提案分类,不足以共同解决跨学科建议数据引入的三个关键独特问题:1)提案的纪律标签的层次结构,谷物,例如,从信息科学到AI,再到AI的基础。 2)在提案中起着不同作用的各种主要文本部分的异质语义; 3)提案的数量在非学科和跨学科研究之间存在不平衡。我们可以同时解决该提案的跨学科性质时的三个问题吗?为了回答这个问题,我们提出了一个层次混音多标签分类框架,我们称之为H-Mixup。 H-Mixup利用基于变压器的语义信息提取器和基于GCN的跨学科知识提取器来解决第一期和第二个问题。 H-Mixup开发了Wold级混音,Word级cutmix,歧管混音和文档级混音的融合训练方法,以解决第三期。
translated by 谷歌翻译
尽管视觉问题答案取得了长足的进步(VQA),但当前的VQA模型严重依赖问题类型及其相应的频繁答案(即语言先验)之间的表面相关性来做出预测,而无需真正理解输入。在这项工作中,我们用相同的问题类型定义了培训实例,但与\ textit {表面上相似的实例}定义了不同的答案,并将语言先验归因于VQA模型在此类情况下的混淆。为了解决这个问题,我们提出了一个新颖的培训框架,该培训框架明确鼓励VQA模型区分表面上相似的实例。具体而言,对于每个培训实例,我们首先构建一个包含其表面上相似的对应物的集合。然后,我们利用所提出的区分模块增加了答案空间中实例及其对应物之间的距离。这样,VQA模型被迫进一步关注问题类型的输入的其他部分,这有助于克服语言先验。实验结果表明,我们的方法在VQA-CP V2上实现了最新性能。代码可在\ href {https://github.com/wyk-nku/distinguishing-vqa.git} {sickithing-vqa}中获得。
translated by 谷歌翻译
二进制代码相似性检测(BCSD)方法测量了两个二进制可执行代码的相似性。最近,基于学习的BCSD方法取得了巨大的成功,在检测准确性和效率方面表现优于传统的BCSD。但是,现有的研究在基于学习的BCSD方法的对抗脆弱性上相当稀疏,这会导致与安全相关的应用程序危害。为了评估对抗性的鲁棒性,本文设计了一种高效且黑色的对抗代码生成算法,即FuncFooler。 FuncFooler限制了对抗代码1)保持程序的控制流程图(CFG)和2)保持相同的语义含义。具体而言,funcfooler连续1)在恶意代码中确定脆弱的候选人,2)从良性代码中选择和插入对抗性指令,以及3)纠正对抗代码的语义副作用以满足约束。从经验上讲,我们的FuncFooler可以成功攻击包括Safe,ASM2VEC和JTRAN在内的三种基于学习的BCSD模型,它们质疑是否需要基于学习的BCSD。
translated by 谷歌翻译
只有单个目标扬声器的语音供参考的单发语音转换(VC)已成为一个热门研究主题。现有作品通常会散布音色,而有关音高,节奏和内容的信息仍然混合在一起。为了进一步删除这些语音组件,有效地执行一声VC,我们采用随机重新采样用于音高和内容编码器,并使用互信息的各种对比对数比率上限和基于梯度反向层的对抗性相互信息学习来确保不同部分在训练过程中仅包含所需的分离表示的潜在空间。 VCTK数据集的实验显示该模型就自然性和智能性方面实现了一声VC的最新性能。此外,我们可以通过语音表示分离分别传递音色,音调和节奏的单发VC的特征。我们的代码,预训练的模型和演示可在https://im1eon.github.io/is2022-Srdvc/上获得。
translated by 谷歌翻译
作为对数据有效使用的研究,多源无监督的域适应性将知识从带有标记数据的多个源域转移到了未标记的目标域。但是,目标域中不同域和嘈杂的伪标签之间的分布差异都导致多源无监督域适应方法的性能瓶颈。鉴于此,我们提出了一种将注意力驱动的领域融合和耐噪声学习(ADNT)整合到上述两个问题的方法。首先,我们建立了相反的注意结构,以在特征和诱导域运动之间执行信息。通过这种方法,当域差异降低时,特征的可区分性也可以显着提高。其次,基于无监督的域适应训练的特征,我们设计了自适应的反向横向熵损失,该损失可以直接对伪标签的产生施加约束。最后,结合了这两种方法,几个基准的实验结果进一步验证了我们提出的ADNT的有效性,并证明了优于最新方法的性能。
translated by 谷歌翻译