The quality of knowledge retrieval is crucial in knowledge-intensive conversations. Two common strategies to improve the retrieval quality are finetuning the retriever or generating a self-contained query, while they encounter heavy burdens on expensive computation and elaborate annotations. In this paper, we propose an unsupervised query enhanced approach for knowledge-intensive conversations, namely QKConv. There are three modules in QKConv: a query generator, an off-the-shelf knowledge selector, and a response generator. Without extra supervision, the end-to-end joint training of QKConv explores multiple candidate queries and utilizes corresponding selected knowledge to yield the target response. To evaluate the effectiveness of the proposed method, we conducted comprehensive experiments on conversational question-answering, task-oriented dialogue, and knowledge-grounded conversation. Experimental results demonstrate that QKConv achieves state-of-the-art performance compared to unsupervised methods and competitive performance compared to supervised methods.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.
translated by 谷歌翻译
Deformable image registration, i.e., the task of aligning multiple images into one coordinate system by non-linear transformation, serves as an essential preprocessing step for neuroimaging data. Recent research on deformable image registration is mainly focused on improving the registration accuracy using multi-stage alignment methods, where the source image is repeatedly deformed in stages by a same neural network until it is well-aligned with the target image. Conventional methods for multi-stage registration can often blur the source image as the pixel/voxel values are repeatedly interpolated from the image generated by the previous stage. However, maintaining image quality such as sharpness during image registration is crucial to medical data analysis. In this paper, we study the problem of anti-blur deformable image registration and propose a novel solution, called Anti-Blur Network (ABN), for multi-stage image registration. Specifically, we use a pair of short-term registration and long-term memory networks to learn the nonlinear deformations at each stage, where the short-term registration network learns how to improve the registration accuracy incrementally and the long-term memory network combines all the previous deformations to allow an interpolation to perform on the raw image directly and preserve image sharpness. Extensive experiments on both natural and medical image datasets demonstrated that ABN can accurately register images while preserving their sharpness. Our code and data can be found at https://github.com/anonymous3214/ABN
translated by 谷歌翻译
Generating new fonts is a time-consuming and labor-intensive, especially in a language with a huge amount of characters like Chinese. Various deep learning models have demonstrated the ability to efficiently generate new fonts with a few reference characters of that style. This project aims to develop a few-shot cross-lingual font generator based on AGIS-Net and improve the performance metrics mentioned. Our approaches include redesigning the encoder and the loss function. We will validate our method on multiple languages and datasets mentioned.
translated by 谷歌翻译
Photo-realistic style transfer aims at migrating the artistic style from an exemplar style image to a content image, producing a result image without spatial distortions or unrealistic artifacts. Impressive results have been achieved by recent deep models. However, deep neural network based methods are too expensive to run in real-time. Meanwhile, bilateral grid based methods are much faster but still contain artifacts like overexposure. In this work, we propose the \textbf{Adaptive ColorMLP (AdaCM)}, an effective and efficient framework for universal photo-realistic style transfer. First, we find the complex non-linear color mapping between input and target domain can be efficiently modeled by a small multi-layer perceptron (ColorMLP) model. Then, in \textbf{AdaCM}, we adopt a CNN encoder to adaptively predict all parameters for the ColorMLP conditioned on each input content and style image pair. Experimental results demonstrate that AdaCM can generate vivid and high-quality stylization results. Meanwhile, our AdaCM is ultrafast and can process a 4K resolution image in 6ms on one V100 GPU.
translated by 谷歌翻译
Reference-based image super-resolution (RefSR) is a promising SR branch and has shown great potential in overcoming the limitations of single image super-resolution. While previous state-of-the-art RefSR methods mainly focus on improving the efficacy and robustness of reference feature transfer, it is generally overlooked that a well reconstructed SR image should enable better SR reconstruction for its similar LR images when it is referred to as. Therefore, in this work, we propose a reciprocal learning framework that can appropriately leverage such a fact to reinforce the learning of a RefSR network. Besides, we deliberately design a progressive feature alignment and selection module for further improving the RefSR task. The newly proposed module aligns reference-input images at multi-scale feature spaces and performs reference-aware feature selection in a progressive manner, thus more precise reference features can be transferred into the input features and the network capability is enhanced. Our reciprocal learning paradigm is model-agnostic and it can be applied to arbitrary RefSR models. We empirically show that multiple recent state-of-the-art RefSR models can be consistently improved with our reciprocal learning paradigm. Furthermore, our proposed model together with the reciprocal learning strategy sets new state-of-the-art performances on multiple benchmarks.
translated by 谷歌翻译
Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.
translated by 谷歌翻译
非视线(NLOS)成像是一种用于检测障碍物或角落周围物体的物体的新兴技术。关于被动NLOS的最新研究主要集中在稳态测量和重建方法上,这些方法显示出识别移动目标的局限性。据我们所知,我们提出了一种新颖的基于事件的无源NLOS成像方法。我们获得了基于事件的异步数据,其中包含NLOS目标的详细动态信息,并有效缓解由运动引起的斑点降解。此外,我们创建了第一个基于事件的NLOS成像数据集NLOS-ES,并且由时间表面表示提取基于事件的功能。我们通过基于事件的数据与基于框架的数据比较重建。基于事件的方法在PSNR和LPIP上表现良好,该方法比基于框架的方法好20%和10%,而数据量仅占传统方法的2%。
translated by 谷歌翻译
三维(3D)图像(例如CT,MRI和PET)在医学成像应用中很常见,在临床诊断中很重要。语义歧义是许多医学图像标签的典型特征。这可能是由许多因素引起的,例如成像特性,病理解剖学以及二进制面具的弱表示,这给精确的3D分割带来了挑战。在2D医学图像中,使用软面膜代替图像垫形式产生的二进制掩码来表征病变可以提供丰富的语义信息,更全面地描述病变的结构特征,从而使后续诊断和分析受益。在这项工作中,我们将图像垫子介绍到3D场景中,以描述3D医学图像中的病变。 3D模态中图像垫的研究有限,并且没有与3D矩阵相关的高质量注释数据集,因此减慢了基于数据驱动的深度学习方法的发展。为了解决这个问题,我们构建了第一个3D医疗垫数据集,并通过质量控制和下游实验中的肺结节分类中令人信服地验证了数据集的有效性。然后,我们将四个选定的最新2D图像矩阵算法调整为3D场景,并进一步自定义CT图像的方法。此外,我们提出了第一个端到端的深3D垫网络,并实施了可靠的3D医疗图像垫测试基准,该基准将被发布以鼓励进一步的研究。
translated by 谷歌翻译