Intent classification and slot filling are two core tasks in natural language understanding (NLU). The interaction nature of the two tasks makes the joint models often outperform the single designs. One of the promising solutions, called BERT (Bidirectional Encoder Representations from Transformers), achieves the joint optimization of the two tasks. BERT adopts the wordpiece to tokenize each input token into multiple sub-tokens, which causes a mismatch between the tokens and the labels lengths. Previous methods utilize the hidden states corresponding to the first sub-token as input to the classifier, which limits performance improvement since some hidden semantic informations is discarded in the fine-tune process. To address this issue, we propose a novel joint model based on BERT, which explicitly models the multiple sub-tokens features after wordpiece tokenization, thereby generating the context features that contribute to slot filling. Specifically, we encode the hidden states corresponding to multiple sub-tokens into a context vector via the attention mechanism. Then, we feed each context vector into the slot filling encoder, which preserves the integrity of the sentence. Experimental results demonstrate that our proposed model achieves significant improvement on intent classification accuracy, slot filling F1, and sentence-level semantic frame accuracy on two public benchmark datasets. The F1 score of the slot filling in particular has been improved from 96.1 to 98.2 (2.1% absolute) on the ATIS dataset.
translated by 谷歌翻译
Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve $\textbf{3D-consistent generation}$. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study $\textbf{3D local editing}$ and propose a two-step solution that can generate 360$^{\circ}$ manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360$^{\circ}$ images can be generated. Last but not least, we extend our model to perform \textbf{one-shot novel view synthesis} by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at https://3ddesigner-diffusion.github.io/.
translated by 谷歌翻译
We present BotSIM, a data-efficient end-to-end Bot SIMulation toolkit for commercial text-based task-oriented dialog (TOD) systems. BotSIM consists of three major components: 1) a Generator that can infer semantic-level dialog acts and entities from bot definitions and generate user queries via model-based paraphrasing; 2) an agenda-based dialog user Simulator (ABUS) to simulate conversations with the dialog agents; 3) a Remediator to analyze the simulated conversations, visualize the bot health reports and provide actionable remediation suggestions for bot troubleshooting and improvement. We demonstrate BotSIM's effectiveness in end-to-end evaluation, remediation and multi-intent dialog generation via case studies on two commercial bot platforms. BotSIM's "generation-simulation-remediation" paradigm accelerates the end-to-end bot evaluation and iteration process by: 1) reducing manual test cases creation efforts; 2) enabling a holistic gauge of the bot in terms of NLU and end-to-end performance via extensive dialog simulation; 3) improving the bot troubleshooting process with actionable suggestions. A demo of our system can be found at https://tinyurl.com/mryu74cd and a demo video at https://youtu.be/qLi5iSoly30. We have open-sourced the toolkit at https://github.com/salesforce/botsim
translated by 谷歌翻译
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
translated by 谷歌翻译
We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon a vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant DINO~\cite{zhang2022dino}, and an efficient DETR training method Group DETR~\cite{chen2022group}. The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO. Group DETR v2 achieves $\textbf{64.5}$ mAP on COCO test-dev, and establishes a new SoTA on the COCO leaderboard https://paperswithcode.com/sota/object-detection-on-coco
translated by 谷歌翻译
最近,基于云的图形卷积网络(GCN)在许多对隐私敏感的应用程序(例如个人医疗保健和金融系统)中表现出了巨大的成功和潜力。尽管在云上具有很高的推理准确性和性能,但在GCN推理中保持数据隐私,这对于这些实际应用至关重要,但仍未得到探索。在本文中,我们对此进行了初步尝试,并开发了$ \ textit {cryptogcn} $ - 基于GCN推理框架的同型加密(HE)。我们方法成功的关键是减少HE操作的巨大计算开销,这可能比明文空间中的同行高的数量级。为此,我们开发了一种方法,可以有效利用GCN推断中基质操作的稀疏性,从而大大减少计算开销。具体而言,我们提出了一种新型的AMA数据格式方法和相关的空间卷积方法,该方法可以利用复杂的图结构并在HE计算中执行有效的矩阵矩阵乘法,从而大大减少HE操作。我们还开发了一个合作式框架,该框架可以通过明智的修剪和GCN中激活模块的多项式近似来探索准确性,安全级别和计算开销之间的交易折扣。基于NTU-Xview骨架关节数据集,即,据我们所知,最大的数据集对同型的评估,我们的实验结果表明,$ \ textit {cryptogcn} $均优胜于最先进的解决方案。同构操作的延迟和数量,即在延迟上达到3.10 $ \ times $加速,并将总代态操作数量减少77.4 \%,而准确度的较小精度损失为1-1.5 $ \%$。
translated by 谷歌翻译
青光眼是一种严重的盲目疾病,迫切需要自动检测方法来减轻眼科医生的稀缺性。许多作品提出采用深度学习方法,涉及视盘和杯中的分割以进行青光眼检测,其中分割过程通常仅被视为上游子任务。在青光眼评估中,底底图像与分割面具之间的关系很少探索。我们提出了一种基于细分的信息提取和融合方法来实现青光眼检测任务,该方法利用了分割掩模的稳健性,而无需忽略原始底底图像中的丰富信息。私有数据集和公共数据集的实验结果表明,我们提出的方法的表现优于所有仅利用底面图像或口罩的模型。
translated by 谷歌翻译
本文提出了一种新方法,该方法融合了混响场中的声学测量和低临界性惯性测量单元(IMU)运动报告,以同时定位和映射(SLAM)。与仅使用声学数据进行到达方向(DOA)估计的现有研究不同,源与传感器的距离是通过直接到依次的能量比(DRR)计算的,并用作新约束以消除非线性噪声从运动报告。应用粒子过滤器估计临界距离,这是将源距离与DRR关联的关键。使用密钥帧方法来消除源位置估计向机器人的偏差。拟议的DOA-DRR声学大满贯(D-D大满贯)设计用于三维运动,适合大多数机器人。该方法是第一个在现实世界中仅包含声学数据和IMU测量值的现实世界室内场景数据集上验证的声学大满贯算法。与以前的方法相比,D-D SLAM在定位机器人和从现实世界室内数据集中构建源地图方面具有可接受的性能。平均位置精度为0.48 m,而源位置误差在2.8 s内收敛到小于0.25 m。这些结果证明了D-D SLAM在现实世界室内场景中的有效性,这可能在环境有雾(即不适合光或激光辐照的环境)之后特别有用。
translated by 谷歌翻译
部分标签学习(PLL)是一项奇特的弱监督学习任务,其中训练样本通常与一组候选标签而不是单个地面真理相关联。尽管在该域中提出了各种标签歧义方法,但他们通常假设在许多现实世界应用中可能不存在类平衡的方案。从经验上讲,我们在面对长尾分布和部分标记的组合挑战时观察到了先前方法的退化性能。在这项工作中,我们首先确定先前工作失败的主要原因。随后,我们提出了一种新型的基于最佳运输的框架太阳能,它允许完善被歧义的标签,以匹配边缘级别的先验分布。太阳能还结合了一种新的系统机制,用于估计PLL设置下的长尾类先验分布。通过广泛的实验,与先前的最先进的PLL方法相比,太阳能在标准化基准方面表现出基本优势。代码和数据可在以下网址获得:https://github.com/hbzju/solar。
translated by 谷歌翻译
在卷积神经网络(CNN)的动力下,医学图像分类迅速发展。由于卷积内核的接受场的固定尺寸,很难捕获医学图像的全局特征。尽管基于自发的变压器可以对远程依赖性进行建模,但它具有很高的计算复杂性,并且缺乏局部电感偏见。许多研究表明,全球和本地特征对于图像分类至关重要。但是,医学图像具有许多嘈杂,分散的特征,类内的变化和类间的相似性。本文提出了三个分支分层的多尺度特征融合网络结构,称为医学图像分类为新方法。它可以融合多尺度层次结构的变压器和CNN的优势,而不会破坏各自的建模,从而提高各种医学图像的分类精度。局部和全局特征块的平行层次结构旨在有效地提取各种语义尺度的本地特征和全局表示,并灵活地在不同的尺度上建模,并与图像大小相关的线性计算复杂性。此外,自适应分层特征融合块(HFF块)旨在全面利用在不同层次级别获得的功能。 HFF块包含空间注意力,通道注意力,残留的倒置MLP和快捷方式,以在每个分支的各个规模特征之间适应融合语义信息。我们在ISIC2018数据集上提出的模型的准确性比基线高7.6%,COVID-19数据集的准确性为21.5%,Kvasir数据集的准确性为10.4%。与其他高级模型相比,HIFUSE模型表现最好。我们的代码是开源的,可从https://github.com/huoxiangzuo/hifuse获得。
translated by 谷歌翻译