本概述论文描述了乌尔都语语言中的假新闻检测的第一个共享任务。该任务是作为二进制分类任务的,目标是区分真实新闻和虚假新闻。我们提供了一个数据集,分为900个注释的新闻文章,用于培训,并进行了400篇新闻文章进行测试。该数据集包含五个领域的新闻:(i)健康,(ii)体育,(iii)Showbiz,(iv)技术和(v)业务。来自6个不同国家(印度,中国,埃及,德国,巴基斯坦和英国)的42个团队登记了这项任务。9个团队提交了他们的实验结果。参与者使用了各种机器学习方法,从基于功能的传统机器学习到神经网络技术。最佳性能系统的F得分值为0.90,表明基于BERT的方法优于其他机器学习技术
translated by 谷歌翻译
随着社交媒体平台影响的增长,滥用的影响变得越来越有影响力。自动检测威胁和滥用语言的重要性不能高估。但是,大多数现有的研究和最先进的方法都以英语为目标语言,对低资产品语言的工作有限。在本文中,我们介绍了乌尔都语的两项滥用和威胁性语言检测的任务,该任务在全球范围内拥有超过1.7亿扬声器。两者都被视为二进制分类任务,其中需要参与系统将乌尔都语中的推文分类为两个类别,即:(i)第一个任务的滥用和不滥用,以及(ii)第二次威胁和不威胁。我们提供两个手动注释的数据集,其中包含标有(i)滥用和非虐待的推文,以及(ii)威胁和无威胁。滥用数据集在火车零件中包含2400个注释的推文,测试部分中包含1100个注释的推文。威胁数据集在火车部分中包含6000个注释的推文,测试部分中包含3950个注释的推文。我们还为这两个任务提供了逻辑回归和基于BERT的基线分类器。在这项共同的任务中,来自六个国家的21个团队注册参加了参与(印度,巴基斯坦,中国,马来西亚,阿拉伯联合酋长国和台湾),有10个团队提交了子任务A的奔跑,这是虐待语言检测,9个团队提交了他们的奔跑对于正在威胁语言检测的子任务B,七个团队提交了技术报告。最佳性能系统达到子任务A的F1得分值为0.880,子任务为0.545。对于两个子任务,基于M-Bert的变压器模型都表现出最佳性能。
translated by 谷歌翻译
这项研究报告了第二个名为Urdufake@Fire2021的共享任务,以识别乌尔都语语言的假新闻检测。这是一个二进制分类问题,在其中,任务是将给定的新闻文章分为两类:(i)真实新闻,或(ii)假新闻。在这项共同的任务中,来自7个不同国家(中国,埃及,以色列,印度,墨西哥,巴基斯坦和阿联酋)的34个团队注册参加了共同的任务,18个团队提交了他们的实验结果,11个团队提交了他们的技术报告。所提出的系统基于各种基于计数的功能,并使用了不同的分类器以及神经网络体系结构。随机梯度下降(SGD)算法的表现优于其他分类器,并达到0.679 F-SCORE。
translated by 谷歌翻译
在当代世界中,自动检测假新闻是一项非常重要的任务。这项研究报告了第二项共享任务,称为Urdufake@fire2021,以识别乌尔都语中的假新闻检测。共同任务的目的是激励社区提出解决这一至关重要问题的有效方法,尤其是对于乌尔都语。该任务被视为二进制分类问题,将给定的新闻文章标记为真实或假新闻文章。组织者提供了一个数据集,其中包括五个领域的新闻:(i)健康,(ii)体育,(iii)Showbiz,(iv)技术和(v)业务,分为培训和测试集。该培训集包含1300篇注释的新闻文章 - 750个真实新闻,550个假新闻,而测试集包含300篇新闻文章 - 200个真实,100个假新闻。来自7个不同国家(中国,埃及,以色列,印度,墨西哥,巴基斯坦和阿联酋)的34个团队注册参加了Urdufake@Fire2021共享任务。在这些情况下,有18个团队提交了实验结果,其中11个提交了技术报告,与2020年的Urdufake共享任务相比,这一报告要高得多,当时只有6个团队提交了技术报告。参与者提交的技术报告展示了不同的数据表示技术,从基于计数的弓形功能到单词矢量嵌入以及使用众多的机器学习算法,从传统的SVM到各种神经网络体系结构,包括伯特和罗伯塔等变形金刚。在今年的比赛中,表现最佳的系统获得了0.679的F1-MACRO得分,低于过去一年的0.907 F1-MaCro的最佳结果。诚然,尽管过去和当前几年的培训集在很大程度上重叠,但如果今年完全不同,则测试集。
translated by 谷歌翻译
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining. To implement MLM, the researcher must make two design choices: the masking strategy, which determines which tokens to mask, and the masking rate, which determines how many tokens to mask. Previous work has focused primarily on the masking strategy while setting the masking rate at a default of 15\%. In this paper, we show that increasing this masking rate improves downstream performance while simultaneously reducing performance gap among different masking strategies, rendering the uniform masking strategy competitive to other more complex ones. Surprisingly, we also discover that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks, suggesting that the role of MLM goes beyond language modeling in VL pretraining.
translated by 谷歌翻译
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies, compared to the local convolutional-based design. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters and compute cost. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on three benchmarks, Synapse, BTCV and ACDC, reveal the effectiveness of the proposed contributions in terms of both efficiency and accuracy. On Synapse dataset, our UNETR++ sets a new state-of-the-art with a Dice Similarity Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best existing method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.
translated by 谷歌翻译
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.
translated by 谷歌翻译
在过去的十年中,我们看到了工业数据,计算能力的巨大改善以及机器学习的重大理论进步。这为在大规模非线性监控和控制问题上使用现代机器学习工具提供了机会。本文对过程行业的应用进行了对最新结果的调查。
translated by 谷歌翻译
我们提出了Covy - 一个机器人平台,可在Covid-19等大流行期间促进社会疏远。Covy具有一种新颖的复合视觉系统,使其能够检测到社会距离的破坏,最多可达16m。Covy使用混合导航堆栈自动地导航其周围环境,该堆栈结合了深钢筋学习(DRL)和概率定位方法。我们通过模拟和现实环境中的大量实验构建了完整的系统并评估了Covy的性能。除其他外,我们的结果表明,与基于DRL的纯解决方案相比,混合导航堆栈更强大。
translated by 谷歌翻译
现有的开放式视频探测器通常通过利用不同形式的弱监督来扩大其词汇大小。这有助于推断出新的对象。开放式视频检测(OVD)中使用的两种流行形式的弱点,包括预审计的剪辑模型和图像级监督。我们注意到,这两种监督模式均未在检测任务中最佳地对齐:剪辑经过图像文本对培训,并且缺乏对象的精确定位,而图像级监督已与启发式方法一起使用,这些启发式方法无法准确指定本地对象区域。在这项工作中,我们建议通过从剪辑模型中执行以对象为中心的语言嵌入来解决此问题。此外,我们仅使用伪标记的过程来视觉上仅通过图像级监督对象,该过程提供高质量的对象建议,并有助于在训练过程中扩展词汇。我们通过新的重量转移函数在上述两个对象对准策略之间建立桥梁,该策略汇总了它们的免费强度。本质上,提出的模型试图最大程度地减少OVD设置中对象和以图像为中心表示之间的差距。在可可基准上,我们提出的方法在新颖类中实现了40.3 AP50,绝对11.9比以前的最佳性能获得了11.9的增长。对于LVIS,我们超过了5.0 Mask AP的最先进VILD模型,总体上有3.4个。 。代码:https://bit.ly/3byzoqp。
translated by 谷歌翻译