消除非静止环境噪声的负面影响是自动语音识别的基础研究课题,仍然是一项重要的挑战。数据驱动的监督方法,包括基于深度神经网络的方法,最近已成为传统无监督方法的潜在替代方案,并且具有足够的训练,可以减轻无监督方法在各种真实声学环境中的缺点。有鉴于此,我们回顾了最近开发的,具有代表性的深度学习方法,用于解决语音的非固定加性和卷积退化问题,旨在为那些参与开发环境健全的语音识别系统的人提供指导。我们分别讨论了为语音识别系统的前端和后端开发的单通道和多通道技术,以及联合前端和后端培训框架。
translated by 谷歌翻译
声学数据提供从生物学和通信到海洋和地球科学等领域的科学和工程见解。我们调查了机器学习(ML)的进步和变革潜力,包括声学领域的深度学习。 ML是用于自动检测和利用模式印度的广泛的统计技术家族。相对于传统的声学和信号处理,ML是数据驱动的。给定足够的训练数据,ML可以发现特征之间的复杂关系。通过大量的训练数据,ML candiscover模型描述复杂的声学现象,如人类语音和混响。声学中的ML正在迅速发展,具有令人瞩目的成果和未来的重大前景。我们首先介绍ML,然后在五个声学研究领域强调MLdevelopments:语音处理中的源定位,海洋声学中的源定位,生物声学,地震探测和日常场景中的环境声音。
translated by 谷歌翻译
In this invited paper, my overview material on the same topic as presented in the plenary overview session of APSIPA-2011 and the tutorial material presented in the same conference [1] are expanded and updated to include more recent developments in deep learning. The previous and the updated materials cover both theory and applications, and analyze its future directions. The goal of this tutorial survey is to introduce the emerging area of deep learning or hierarchical learning to the APSIPA community. Deep learning refers to a class of machine learning techniques, developed largely since 2006, where many stages of non-linear information processing in hierarchical architectures are exploited for pattern classification and for feature learning. In the more recent literature, it is also connected to representation learning, which involves a hierarchy of features or concepts where higher-level concepts are defined from lower-level ones and where the same lower-level concepts help to define higher-level ones. In this tutorial survey, a brief history of deep learning research is discussed first. Then, a classificatory scheme is developed to analyze and summarize major work reported in the recent deep learning literature. Using this scheme, I provide a taxonomy-oriented survey on the existing deep architectures and algorithms in the literature, and categorize them into three classes: generative, discriminative, and hybrid. Three representative deep architectures-deep autoencoders, deep stacking networks with their generalization to the temporal domain (recurrent networks), and deep neural networks (pretrained with deep belief networks)-one in each of the three classes, are presented in more detail. Next, selected applications of deep learning are reviewed in broad areas of signal and information processing including audio/speech, image/vision, multimodality, language modeling, natural language processing, and information retrieval. Finally, future directions of deep learning are discussed and analyzed.
translated by 谷歌翻译
Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks (CNN) are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks (RNNs) are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.
translated by 谷歌翻译
Since the proposal of a fast learning algorithm for deep belief networks in 2006, the deep learning techniques have drawn ever-increasing research interests because of their inherent capability of overcoming the drawback of traditional algorithms dependent on hand-designed features. Deep learning approaches have also been found to be suitable for big data analysis with successful applications to computer vision, pattern recognition, speech recognition, natural language processing, and recommendation systems. In this paper, we discuss some widely-used deep learning architectures and their practical applications. An up-to-date overview is provided on four deep learning architectures, namely, autoencoder, convolutional neural network, deep belief network, and restricted Boltzmann machine. Different types of deep neural networks are surveyed and recent progresses are summarized. Applications of deep learning techniques on some selected areas (speech recognition, pattern recognition and computer vision) are highlighted. A list of future research topics are finally given with clear justifications.
translated by 谷歌翻译
New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.
translated by 谷歌翻译
Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem-for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.
translated by 谷歌翻译
多模态学习一直缺乏结合不同形式的信息和学习有意义的低维数表达的原则性方法。我们从潜在的变量角度研究多模态学习和传感器融合。我们首先提出了一种用于传感器融合的正则化反复注意滤波器。该算法可以在顺序决策任务中动态组合来自不同类型传感器的信息。每个传感器都与模块化神经网络结合,以最大化其自身信息的效用。门控模块化神经网络通过平衡所有传感器信息的效用,动态地为传感器网络的输出生成一组混合权重。我们设计了一种共同学习机制,以鼓励同时对每个传感器进行自适应和独立学习,并提出一种基于正则化的协同学习方法。在第二部分中,我们重点关注恢复潜在表征的多样性。我们提出了一种使用概率图模型的共同学习方法,该模型在生成模型中强加了结构先验:多模态变分RNN(MVRNN)模型,并导出其目标函数的变分下界。在第三部分中,我们将暹罗结构扩展到传感器融合,以实现稳健的声学事件检测。我们进行实验来研究提取的潜在表征;工作将在接下来的几个月内完成。我们的实验表明,周期性注意过滤器可以根据输入中携带的信息动态组合不同的传感器输入。我们认为MVRNN可以识别对许多下游任务有用的潜在表示,例如语音合成,活动识别以及控制和规划。这两种算法都是通用框架,可以应用于其他任务,其中不同类型的传感器共同用于决策。
translated by 谷歌翻译
在过去几年中,自动说话人识别(ASV)的表现攻击检测(PAD)领域取得了重大进展。这包括开发新的语音语料库,标准评估协议以及前端特征提取和后端分类器的进步。 。标准数据库和评估协议的使用首次实现了对不同PAD解决方案的有意义的基准测试。本章总结了进展,重点关注过去三年完成的研究。本文概述了两个ASVspoof挑战的结果和经验教训,这是第一个以社区为主导的基准测试工作。这表明ASV PAD仍然是一个尚未解决的问题,需要进一步关注开发广泛的PAD解决方案,这些解决方案有可能检测出多样化和以前看不见的欺骗行为。攻击。
translated by 谷歌翻译
本文提出了一种深度语音增强方法,该方法利用广泛神经网络结构中的残余连接的高电位,称为宽残余网络。这在与时域一起计算的单维卷积上得到支持,这是通过时间域处理上下文相关表示的强有力的方法,例如语音特征序列。我们发现残余机制对于增强任务非常有用,因为信号总是具有线性捷径,并且非线性路径通过添加或减去校正在几个步骤中增强它。该提议的增强能力由客观质量度量和语音识别系统的性能评估。这是在REVERB Challengedataset的框架内进行评估的,包括混响和嘈杂语音信号的模拟和真实样本。结果表明,该方法的增强语音成功,具有可懂度的增强任务和语音识别系统。通过人工合成混响数据训练的DNN模型能够处理来自真实场景的远场混响语音。此外,该方法能够利用残余连接实现增强低噪声信号,这通常是传统增强方法的一个强大障碍。
translated by 谷歌翻译
本文介绍了WaveNet,一种用于生成原始音频波形的深度神经网络。该模型是完全概率和自回归的,每个音频样本的预测分布都取决于所有以前的模型;尽管如此,我们证明它可以有效地训练数据,每秒音频数万个样本。当应用于文本到语音时,ityield具有最先进的性能,人类听众将其评定为比英语和普通话的最佳参数和连接系统更自然的声音。单个WaveNet可以捕获具有相同保真度的许多不同扬声器的特性,并且可以通过调节扬声器身份在它们之间切换。当受过模拟音乐训练时,我们发现它可以产生新颖且通常非常逼真的音乐碎片。我们还表明它可以用作判别模型,为音素识别返回有希望的结果。
translated by 谷歌翻译
在过去几年中,自然语言处理领域受到深度学习模型使用爆炸式推进的推动。本调查简要介绍了该领域,并简要介绍了深度学习架构和方法。然后,它通过大量的研究进行筛选,并总结了大量相关的贡献。经过分析的研究领域包括几个核心语言处理问题,以及计算语言学的许多应用。然后提供对现有技术的讨论以及该领域中的未来研究的建议。
translated by 谷歌翻译
We present a supervised neural network model for polyphonic piano music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language model. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We perform two sets of experiments. We investigate various neural network architectures for the acoustic models and also investigate the effect of combining acoustic and music language model predictions using the proposed architecture. We compare performance of the neural network based acoustic models with two popular unsupervised acoustic models. Results show that convolutional neural network acoustic models yields the best performance across all evaluation metrics. We also observe improved performance with the application of the music language models. Finally, we present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications.
translated by 谷歌翻译
Our experience of the world is multimodal - we see objects, hear sounds, feeltexture, smell odors, and taste flavors. Modality refers to the way in whichsomething happens or is experienced and a research problem is characterized asmultimodal when it includes multiple such modalities. In order for ArtificialIntelligence to make progress in understanding the world around us, it needs tobe able to interpret such multimodal signals together. Multimodal machinelearning aims to build models that can process and relate information frommultiple modalities. It is a vibrant multi-disciplinary field of increasingimportance and with extraordinary potential. Instead of focusing on specificmultimodal applications, this paper surveys the recent advances in multimodalmachine learning itself and presents them in a common taxonomy. We go beyondthe typical early and late fusion categorization and identify broaderchallenges that are faced by multimodal machine learning, namely:representation, translation, alignment, fusion, and co-learning. This newtaxonomy will enable researchers to better understand the state of the fieldand identify directions for future research.
translated by 谷歌翻译
自动人类情感识别是实现更自然的人机交互的关键一步。最近的趋势包括使用视听和生理传感器的融合在野外进行识别,这是传统机器学习算法的挑战性设置。自2010年以来,新的深度学习算法越来越多地应用于该领域。在本文中,分析了2010年至2017年间人类情感识别的文献,特别关注使用深度神经网络的方法。通过根据他们对浅层或深层架构的使用对950项研究进行分类,我们能够展示深度学习的趋势。回顾一下采用深度神经网络的233个研究的子集,我们全面量化了该领域的应用。我们发现深度学习用于学习(i)空间特征表示,(ii)时间特征表示,以及(iii)多模态传感器数据的联合特征表示。示例性的最先进的架构说明了进展。我们的研究结果表明,深层次的体系结构将在人类情感识别中发挥作用,并可作为研究相关应用的研究人员的参考点。
translated by 谷歌翻译
In the era of the Internet of Things (IoT), an enormous amount of sensing devices collect and/or generate various sensory data over time for a wide range of fields and applications. Based on the nature of the application, these devices will result in big or fast/real-time data streams. Applying analytics over such data streams to discover new information, predict future insights, and make control decisions is a crucial process that makes IoT a worthy paradigm for businesses and a quality-of-life improving technology. In this paper, we provide a thorough overview on using a class of advanced machine learning techniques, namely Deep Learning (DL), to facilitate the analytics and learning in the IoT domain. We start by articulating IoT data characteristics and identifying two major treatments for IoT data from a machine learning perspective, namely IoT big data analytics and IoT streaming data analytics. We also discuss why DL is a promising approach to achieve the desired analytics in these types of data and applications. The potential of using emerging DL techniques for IoT data analytics are then discussed, and its promises and challenges are introduced. We present a comprehensive background on different DL architectures and algorithms. We also analyze and summarize major reported research attempts that leveraged DL in the IoT domain. The smart IoT devices that have incorporated DL in their intelligence background are also discussed. DL implementation approaches on the fog and cloud centers in support of IoT applications are also surveyed. Finally, we shed light on some challenges and potential directions for future research. At the end of each section, we highlight the lessons learned based on our experiments and review of the recent literature.
translated by 谷歌翻译
This paper gives an overview of automatic speak er recognition technology, with an emphasis on text-independent recognition. Speak er recognition has been studied actively for several decades. W e give an overview of both the classical and the state-of-the-art methods. W e start with the fundamentals of automatic speak er recognition, concerning feature extraction and speak er modeling. W e elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. W e also provide an overview of this recent development and discuss the evaluation methodology of speak er recognition systems. W e conclude the paper with discussion on future directions.
translated by 谷歌翻译
Voice transformation (VT) aims to change one or more aspects of a speech signal while preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to change a source speaker's speech in such a way that the generated output is perceived as a sentence uttered by a target speaker. Despite many years of research, VC systems still exhibit deficiencies in accurately mimicking a target speaker spectrally and prosodically, and simultaneously maintaining high speech quality. In this work we provide an overview of real-world applications, extensively study existing systems proposed in the literature, and discuss remaining challenges.
translated by 谷歌翻译
In recent years, deep learning has been used extensively in a wide range of fields. In deep learning, Convolutional Neural Networks are found to give the most accurate results in solving real world problems. In this paper, we give a comprehensive summary of the applications of CNN in computer vision and natural language processing. We delineate how CNN is used in computer vision, mainly in face recognition, scene labelling, image classification, action recognition, human pose estimation and document analysis. Further, we describe how CNN is used in the field of speech recognition and text classification for natural language processing. We compare CNN with other methods to solve the same problem and explain why CNN is better than other methods.
translated by 谷歌翻译