Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify the extent of this effect, we conduct a series of controlled experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Transferring these learnings onto the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples.
translated by 谷歌翻译
机器学习模型的培训和部署之间的分离意味着,在培训期间,并非所有部署中遇到的场景都可以预期,因此仅依靠培训的进步都有其限制。分布(OOD)检测是一个重要领域,强调模型处理看不见情况的能力:模型知道何时不知道吗?现有的OOD检测方法要么引起额外的训练步骤,其他数据或对训练的网络进行非平凡的修改。相比之下,在这项工作中,我们提出了一种非常简单的事后,即时激活塑形方法,灰分,其中大部分(例如90%)的样本激活在后层中被删除,然后删除休息(例如10%)简化或轻微调整。该塑形在推理时间应用,不需要根据培训数据计算出的任何统计数据。实验表明,这种简单的治疗可以增强分布和分布样本的区别,从而允许在ImageNet上进行最新的OOD检测,并且不会显着恶化分布的准确性。我们与论文一起释放了两个呼吁解释和验证的呼吁,他们相信集体权力进一步验证和理解这一发现。可以在:https://andrijazz.github.io/ash上找到电话,视频和代码
translated by 谷歌翻译
开放词汇模型是图像分类的有希望的新范式。与传统的分类模型不同,开放词汇模型在推理过程中用自然语言指定的任何任意类别中分类。这种称为“提示”的自然语言通常由一组手写的模板(例如,“ {}”的照片)组成,这些模板与每个类别名称完成。这项工作引入了一种简单的方法,可以生成更高的准确性提示,而无需对图像域的明确知识和更少的手工构造句子。为了实现这一目标,我们将开放式词汇模型与大语言模型(LLMS)相结合,以通过语言模型(Cupl,发音为“夫妇”)创建自定义提示。特别是,我们利用LLMS中包含的知识来生成许多针对每个对象类别定制的描述性句子。我们发现,这种直接和一般的方法可提高一系列零照片分类基准的准确性,包括ImageNet上超过一个百分比的增益。最后,此方法不需要额外的培训,并且仍然完全零射。代码可在https://github.com/sarahpratt/cupl上找到。
translated by 谷歌翻译
语言模型既展示了定量的改进,又展示了新的定性功能,随着规模的增加。尽管它们具有潜在的变革性影响,但这些新能力的特征却很差。为了为未来的研究提供信息,为破坏性的新模型能力做准备,并改善社会有害的效果,至关重要的是,我们必须了解目前和近乎未来的能力和语言模型的局限性。为了应对这一挑战,我们介绍了超越模仿游戏基准(Big Bench)。 Big Bench目前由204个任务组成,由132家机构的442位作者贡献。任务主题是多样的,从语言学,儿童发展,数学,常识性推理,生物学,物理学,社会偏见,软件开发等等。 Big-Bench专注于被认为超出当前语言模型的功能的任务。我们评估了OpenAI的GPT型号,Google内部密集变压器体系结构和大型基础上的开关稀疏变压器的行为,跨越了数百万到数十亿个参数。此外,一个人类专家评估者团队执行了所有任务,以提供强大的基准。研究结果包括:模型性能和校准都随规模改善,但绝对的术语(以及与评估者的性能相比);在模型类中的性能非常相似,尽管带有稀疏性。逐渐和预测的任务通常涉及大量知识或记忆成分,而在临界规模上表现出“突破性”行为的任务通常涉及多个步骤或组成部分或脆性指标;社交偏见通常会随着含糊不清的环境而随着规模而增加,但这可以通过提示来改善。
translated by 谷歌翻译
尽管最先进的物体检测方法显示出令人信服的性能,但模型通常对对抗的攻击和分发数据不稳健。我们介绍了一个新的数据集,天然对手对象(Nao),以评估物体检测模型的稳健性。 Nao包含7,934个图像和9,943个对象,这些对象未经修改,代表了现实世界的情景,但导致最先进的检测模型以高信任误入歧途。与标准MSCOCO验证集相比,在NAO上评估时,高效的平均平均精度(MAP)降低74.5%。此外,通过比较各种对象检测架构,我们发现Mscoco验证集上的更好性能不一定转化为NAO的更好性能,这表明不能通过培训更准确的模型来简单地实现鲁棒性。我们进一步调查为什么NA​​O中难以检测和分类的原因。洗牌图像贴片的实验表明,模型对局部质地过于敏感。此外,使用集成梯度和背景替换,我们发现检测模型依赖于边界框内的像素信息,并且在预测类标签时对背景上下文不敏感。 Nao可以在https://drive.google.com/drive/folders/15p8sowojku6sseihlets86orfytgezi8下载。
translated by 谷歌翻译
我们开发用于测试两个或多个数据流是否来自同一源的电子变量,更普遍地说,源之间的差异是否大于某些最小效应大小。这些电子变量导致精确的非肌电测试,这些测试仍然是安全的,即在柔性采样场景(例如可选的停止和延续)下,保持其类型错误保证。在特殊情况下,我们的电子变量在替代方面也具有最佳的“增长”特性。虽然构造是通用的,但我们通过K x 2应急表的特殊情况进行了说明,我们还允许在复合替代方案上纳入不同的限制。与模拟中的p值分析和现实世界中的p值分析进行比较,表明电子变量通过其灵活性,通常允许早日停止数据收集,从而保留与经典方法相似的功率,同时还保留了扩展或结合的选项之后数据。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译