Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
translated by 谷歌翻译
该技术报告描述了surreyaudioteam22s Dcase 2022 ASC任务1,低复杂性声学场景分类(ASC)。该任务有两个规则,(a)ASC框架应具有最大128K参数,并且(b)每个推理最多应有3000万次多功能操作(MAC)。在本报告中,我们为ASC提供了遵循该任务规则的ASC的低复杂系统。
translated by 谷歌翻译
本文提出了用于声学场景分类(ASC)的低复杂框架。与手工设计的功能相比,大多数旨在ASC设计的框架使用卷积神经网络(CNN)。但是,CNN由于其尺寸较大和计算复杂性而渴望资源。因此,CNN难以在资源约束设备上部署。本文解决了减少CNN中计算复杂性和内存需求的问题。我们提出了一个低复杂性CNN体系结构,并应用修剪和量化以进一步减少参数和内存。然后,我们提出了一个合奏框架,该框架结合了各种低复杂性CNN,以提高整体性能。对拟议框架进行的实验评估是对关注ASC的公开DCASE 2022任务1进行的。所提出的合奏框架的参数约为60k,需要19m的多功能操作,并且与Dcase 2022 Task 1基线网络相比,该性能提高了约2-4个百分点。
translated by 谷歌翻译
几个示例,几乎没有射击的生物声学事件检测是检测新声音的发生时间的任务。先前的方法采用公制学习来建立一个潜在空间,其中包括不同声音类别的标记部分,也称为积极事件。在这项研究中,我们提出了一个细分级的几杆学习框架,该框架在模型优化过程中利用正面和负面事件。负面事件的训练比积极事件更大,可以提高模型的概括能力。此外,我们对训练期间的验证集使用跨性推断,以更好地适应新的课程。我们对我们提出的方法进行消融研究,并在输入特征,训练数据和超参数上进行不同的设置。我们的最终系统在DCASE 2022挑战任务5(DCASE2022-T5)验证集上实现了62.73的F量,以优于基线原型网络34.02的性能。使用提出的方法,我们提交的系统在Dcase2022-T5中排名第二。本文的代码在https://github.com/haoheliu/dcase_2022_task_5上完全开源。
translated by 谷歌翻译
鉴于对计算资源的限制(例如,模型大小,跑步内存)的限制,不断学习新课程而没有灾难性遗忘是一个具有挑战性的问题。为了解决这个问题,我们提出了一种简单有效的持续学习方法。我们的方法通过测量按样本分类不确定性来选择培训的历史数据。具体而言,我们通过观察数据的分类概率如何与添加到分类器嵌入中的平行扰动相比如何波动来测量不确定性。通过这种方式,与将扰动添加到原始数据相比,计算成本可以大大降低。 DCASE 2019任务1和ESC-50数据集的实验结果表明,我们所提出的方法优于基准的分类准确性和计算效率的基线连续学习方法,表明我们的方法可以有效,可以逐步学习新的课程,而无需用于灾难性环境的灾难性遗忘问题声音分类。
translated by 谷歌翻译
自动音频字幕是一项跨模式翻译任务,旨在为给定的音频剪辑生成自然语言描述。近年来,随着免费可用数据集的发布,该任务受到了越来越多的关注。该问题主要通过深度学习技术解决。已经提出了许多方法,例如研究不同的神经网络架构,利用辅助信息,例如关键字或句子信息来指导字幕生成,并采用了不同的培训策略,这些策略极大地促进了该领域的发展。在本文中,我们对自动音频字幕的已发表贡献进行了全面综述,从各种现有方法到评估指标和数据集。我们还讨论了公开挑战,并设想可能的未来研究方向。
translated by 谷歌翻译
自动音频标题(AAC)是一种跨模型翻译任务,旨在使用自然语言来描述音频剪辑的内容。如在DCEAD 2021挑战的任务6所接收的提交所示,这一问题已受到越来越兴趣的社区。现有的AAC系统通常基于编码器解码器架构,其中音频信号被编码为潜像表示,并与其对应的文本描述对齐,则使用解码器来生成标题。然而,AAC系统的培训经常遇到数据稀缺问题,这可能导致不准确的表示和音频文本对齐。为了解决这个问题,我们提出了一种名为对比损耗的新型编码器解码器框架(CL4AC)。在CL4AC中,通过对比样本来利用来自原始音频文本成对数据的自我监督信号来利用音频和文本之间的对应关系,该样本可以提高潜在表示的质量和音频和文本之间的对齐,同时训练有限的数据。实验是在披丁数据集上进行的,以显示我们提出的方法的有效性。
translated by 谷歌翻译
Word Sense Disambiguation (WSD) is an NLP task aimed at determining the correct sense of a word in a sentence from discrete sense choices. Although current systems have attained unprecedented performances for such tasks, the nonuniform distribution of word senses during training generally results in systems performing poorly on rare senses. To this end, we consider data augmentation to increase the frequency of these least frequent senses (LFS) to reduce the distributional bias of senses during training. We propose Sense-Maintained Sentence Mixup (SMSMix), a novel word-level mixup method that maintains the sense of a target word. SMSMix smoothly blends two sentences using mask prediction while preserving the relevant span determined by saliency scores to maintain a specific word's sense. To the best of our knowledge, this is the first attempt to apply mixup in NLP while preserving the meaning of a specific word. With extensive experiments, we validate that our augmentation method can effectively give more information about rare senses during training with maintained target sense label.
translated by 谷歌翻译
Network intrusion detection systems (NIDS) to detect malicious attacks continues to meet challenges. NIDS are vulnerable to auto-generated port scan infiltration attempts and NIDS are often developed offline, resulting in a time lag to prevent the spread of infiltration to other parts of a network. To address these challenges, we use hypergraphs to capture evolving patterns of port scan attacks via the set of internet protocol addresses and destination ports, thereby deriving a set of hypergraph-based metrics to train a robust and resilient ensemble machine learning (ML) NIDS that effectively monitors and detects port scanning activities and adversarial intrusions while evolving intelligently in real-time. Through the combination of (1) intrusion examples, (2) NIDS update rules, (3) attack threshold choices to trigger NIDS retraining requests, and (4) production environment with no prior knowledge of the nature of network traffic 40 scenarios were auto-generated to evaluate the ML ensemble NIDS comprising three tree-based models. Results show that under the model settings of an Update-ALL-NIDS rule (namely, retrain and update all the three models upon the same NIDS retraining request) the proposed ML ensemble NIDS produced the best results with nearly 100% detection performance throughout the simulation, exhibiting robustness in the complex dynamics of the simulated cyber-security scenario.
translated by 谷歌翻译
Human behavior emerges from planning over elaborate decompositions of tasks into goals, subgoals, and low-level actions. How are these decompositions created and used? Here, we propose and evaluate a normative framework for task decomposition based on the simple idea that people decompose tasks to reduce the overall cost of planning while maintaining task performance. Analyzing 11,117 distinct graph-structured planning tasks, we find that our framework justifies several existing heuristics for task decomposition and makes predictions that can be distinguished from two alternative normative accounts. We report a behavioral study of task decomposition ($N=806$) that uses 30 randomly sampled graphs, a larger and more diverse set than that of any previous behavioral study on this topic. We find that human responses are more consistent with our framework for task decomposition than alternative normative accounts and are most consistent with a heuristic -- betweenness centrality -- that is justified by our approach. Taken together, our results provide new theoretical insight into the computational principles underlying the intelligent structuring of goal-directed behavior.
translated by 谷歌翻译