由于深度学习在许多人工智能应用中显示了革命性的性能,其升级的计算需求需要用于巨大并行性的硬件加速器和改进的吞吐量。光学神经网络(ONN)是下一代神经关键组成的有希望的候选者,由于其高并行,低延迟和低能量消耗。在这里,我们设计了一个硬件高效的光子子空间神经网络(PSNN)架构,其针对具有比具有可比任务性能的前一个ONN架构的光学元件使用,区域成本和能量消耗。此外,提供了一种硬件感知培训框架,以最小化所需的设备编程精度,减少芯片区域,并提高噪声鲁棒性。我们在实验上展示了我们的PSNN在蝴蝶式可编程硅光子集成电路上,并在实用的图像识别任务中显示其实用性。
translated by 谷歌翻译
近年来,人工智能(AI)的领域已经见证了巨大的增长,然而,持续发展的一些最紧迫的挑战是电子计算机架构所面临的基本带宽,能效和速度限制。利用用于执行神经网络推理操作的光子处理器越来越感兴趣,但是这些网络目前使用标准数字电子培训。这里,我们提出了由CMOS兼容的硅光子架构实现的神经网络的片上训练,以利用大规模平行,高效和快速数据操作的电位。我们的方案采用直接反馈对准训练算法,它使用错误反馈而不是错误反向化而培训神经网络,并且可以在每秒乘以数万亿乘以量的速度运行,同时每次MAC操作消耗小于一个微微约会。光子架构利用并行化矩阵 - 向量乘法利用微址谐振器阵列,用于沿着单个波导总线处理多通道模拟信号,以便原位计算每个神经网络层的梯度向量,这是在后向通过期间执行的最昂贵的操作。 。我们还通过片上MAC操作结果实验地示意使用MNIST数据集进行培训深度神经网络。我们的高效,超快速神经网络训练的新方法展示了光子学作为执行AI应用的有希望的平台。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
基于von-neumann架构的传统计算系统,数据密集型工作负载和应用程序(如机器学习)和应用程序都是基本上限制的。随着数据移动操作和能量消耗成为计算系统设计中的关键瓶颈,对近数据处理(NDP),机器学习和特别是神经网络(NN)的加速器等非传统方法的兴趣显着增加。诸如Reram和3D堆叠的新兴内存技术,这是有效地架构基于NN的基于NN的加速器,因为它们的工作能力是:高密度/低能量存储和近记忆计算/搜索引擎。在本文中,我们提出了一种为NN设计NDP架构的技术调查。通过基于所采用的内存技术对技术进行分类,我们强调了它们的相似之处和差异。最后,我们讨论了需要探索的开放挑战和未来的观点,以便改进和扩展未来计算平台的NDP架构。本文对计算机学习领域的计算机架构师,芯片设计师和研究人员来说是有价值的。
translated by 谷歌翻译
随着深度神经网络(DNN)的发展以解决日益复杂的问题,它们正受到现有数字处理器的延迟和功耗的限制。为了提高速度和能源效率,已经提出了专门的模拟光学和电子硬件,但是可扩展性有限(输入矢量长度$ k $的数百个元素)。在这里,我们提出了一个可扩展的,单层模拟光学处理器,该光学处理器使用自由空间光学器件可重新配置输入向量和集成的光电,用于静态,可更新的加权和非线性 - 具有$ k \ \ 1,000 $和大约1,000美元和超过。我们通过实验测试MNIST手写数字数据集的分类精度,在没有数据预处理或在硬件上进行数据重新处理的情况下达到94.7%(地面真相96.3%)。我们还确定吞吐量($ \ sim $ 0.9 examac/s)的基本上限,由最大光带宽设置,然后大大增加误差。我们在兼容CMOS兼容系统中宽光谱和空间带宽的组合可以实现下一代DNN的高效计算。
translated by 谷歌翻译
具有最小延迟的人工神经网络的决策对于诸如导航,跟踪和实时机器动作系统之类的许多应用来说是至关重要的。这要求机器学习硬件以高吞吐量处理多维数据。不幸的是,处理卷积操作是数据分类任务的主要计算工具,遵循有挑战性的运行时间复杂性缩放法。然而,在傅立叶光学显示器 - 光处理器中同心地实现卷积定理,使得不迭代的O(1)运行时复杂度以超过1,000×1,000大矩阵的数据输入。在此方法之后,这里我们展示了具有傅里叶卷积神经网络(FCNN)加速器的数据流多核图像批处理。我们将大规模矩阵的图像批量处理显示为傅立叶域中的数字光处理模块执行的被动的2000万点产品乘法。另外,我们通过利用多种时空衍射令并进一步并行化该光学FCNN系统,从而实现了最先进的FCNN加速器的98倍的产量改进。综合讨论与系统能力边缘工作相关的实际挑战突出了傅立叶域和决议缩放法律的串扰问题。通过利用展示技术中的大规模平行性加速卷积带来了基于VAN Neuman的机器学习加速度。
translated by 谷歌翻译
尖峰神经网络(SNN)提供了一个新的计算范式,能够高度平行,实时处理。光子设备是设计与SNN计算范式相匹配的高带宽,平行体系结构的理想选择。 CMO和光子元件的协整允许将低损耗的光子设备与模拟电子设备结合使用,以更大的非线性计算元件的灵活性。因此,我们在整体硅光子学(SIPH)过程上设计和模拟了光电尖峰神经元电路,该过程复制了超出泄漏的集成和火(LIF)之外有用的尖峰行为。此外,我们探索了两种学习算法,具有使用Mach-Zehnder干涉法(MZI)网格作为突触互连的片上学习的潜力。实验证明了随机反向传播(RPB)的变体,并在简单分类任务上与标准线性回归的性能相匹配。同时,将对比性HEBBIAN学习(CHL)规则应用于由MZI网格组成的模拟神经网络,以进行随机输入输出映射任务。受CHL训练的MZI网络的性能比随机猜测更好,但不符合理想神经网络的性能(没有MZI网格施加的约束)。通过这些努力,我们证明了协调的CMO和SIPH技术非常适合可扩展的SNN计算体系结构的设计。
translated by 谷歌翻译
IOT应用中的总是关于Tinyml的感知任务需要非常高的能量效率。模拟计算内存(CIM)使用非易失性存储器(NVM)承诺高效率,并提供自包含的片上模型存储。然而,模拟CIM推出了新的实际考虑因素,包括电导漂移,读/写噪声,固定的模数转换器增益等。必须解决这些附加约束,以实现可以通过可接受的模拟CIM部署的模型精度损失。这项工作描述了$ \ textit {analognets} $:tinyml模型用于关键字点(kws)和视觉唤醒词(VWW)的流行始终是on。模型架构专门为模拟CIM设计,我们详细介绍了一种全面的培训方法,以在推理时间内保持面对模拟非理想的精度和低精度数据转换器。我们还描述了AON-CIM,可编程,最小面积的相变存储器(PCM)模拟CIM加速器,具有新颖的层串行方法,以消除与完全流水线设计相关的复杂互连的成本。我们在校准的模拟器以及真正的硬件中评估了对校准模拟器的矛盾,并发现精度下降限制为KWS / VWW的PCM漂移(8位)24小时后的0.8 $ \%$ / 1.2 $ \%$。在14nm AON-CIM加速器上运行的analognets使用8位激活,分别使用8位激活,并增加到57.39 / 25.69个顶部/ w,以4美元$ 4 $ 57.39 / 25.69。
translated by 谷歌翻译
在本文中,提出了一种新的方法,该方法允许基于神经网络(NN)均衡器的低复杂性发展,以缓解高速相干光学传输系统中的损伤。在这项工作中,我们提供了已应用于馈电和经常性NN设计的各种深层模型压缩方法的全面描述和比较。此外,我们评估了这些策略对每个NN均衡器的性能的影响。考虑量化,重量聚类,修剪和其他用于模型压缩的尖端策略。在这项工作中,我们提出并评估贝叶斯优化辅助压缩,其中选择了压缩的超参数以同时降低复杂性并提高性能。总之,通过使用模拟和实验数据来评估每种压缩方法的复杂性及其性能之间的权衡,以完成分析。通过利用最佳压缩方法,我们表明可以设计基于NN的均衡器,该均衡器比传统的数字背部传播(DBP)均衡器具有更好的性能,并且只有一个步骤。这是通过减少使用加权聚类和修剪算法后在NN均衡器中使用的乘数数量来完成的。此外,我们证明了基于NN的均衡器也可以实现卓越的性能,同时仍然保持与完整的电子色色散补偿块相同的复杂性。我们通过强调开放问题和现有挑战以及未来的研究方向来结束分析。
translated by 谷歌翻译
In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
translated by 谷歌翻译
在过去的二十年中,癫痫发作检测和预测算法迅速发展。然而,尽管性能得到了重大改进,但它们使用常规技术(例如互补的金属氧化物 - 轴导剂(CMO))进行的硬件实施,在权力和面积受限的设置中仍然是一项艰巨的任务;特别是当使用许多录音频道时。在本文中,我们提出了一种新型的低延迟平行卷积神经网络(CNN)体系结构,与SOTA CNN体系结构相比,网络参数少2-2,800倍,并且达到5倍的交叉验证精度为99.84%,用于癫痫发作检测,检测到99.84%。癫痫发作预测的99.01%和97.54%分别使用波恩大学脑电图(EEG),CHB-MIT和SWEC-ETHZ癫痫发作数据集进行评估。随后,我们将网络实施到包含电阻随机存储器(RRAM)设备的模拟横梁阵列上,并通过模拟,布置和确定系统中CNN组件的硬件要求来提供全面的基准。据我们所知,我们是第一个平行于在单独的模拟横杆上执行卷积层内核的人,与SOTA混合Memristive-CMOS DL加速器相比,潜伏期降低了2个数量级。此外,我们研究了非理想性对系统的影响,并研究了量化意识培训(QAT),以减轻由于ADC/DAC分辨率较低而导致的性能降解。最后,我们提出了一种卡住的重量抵消方法,以减轻因卡住的Ron/Roff Memristor重量而导致的性能降解,而无需再进行重新培训而恢复了高达32%的精度。我们平台的CNN组件估计在22nm FDSOI CMOS流程中占据31.255mm $^2 $的面积约为2.791W。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
综合光子神经网络(IPNN)成为常规电子AI加速器的有前途的后继者,因为它们在计算速度和能源效率方面提供了实质性的提高。特别是,相干IPNN使用Mach-Zehnder干涉仪(MZIS)的阵列进行单位转换来执行节能矩阵矢量乘法。然而,IPNN中的基本MZI设备易受光刻变化和热串扰引起的不确定性,并且由于不均匀的MZI插入损失和量化错误而导致不确定的不确定性,这是由于调谐相角的编码较低而导致的。在本文中,我们首次使用自下而上的方法系统地表征了IPNN中这种不确定性和不确定性(共同称为缺陷)的影响。我们表明,它们对IPNN准确性的影响可能会根据受影响组件的调谐参数(例如相角),其物理位置以及缺陷的性质和分布而差异很大。为了提高可靠性措施,我们确定了关键的IPNN构件,在不完美之下,这些基础可能导致分类准确性的灾难性降解。我们表明,在多个同时缺陷下,即使不完美参数限制在较小的范围内,IPNN推断精度也可能会降低46%。我们的结果还表明,推论精度对影响IPNN输入层旁边的线性层中MZI的缺陷敏感。
translated by 谷歌翻译
在小型电池约束的物流设备上部署现代TinyML任务需要高计算能效。使用非易失性存储器(NVM)的模拟内存计算(IMC)承诺在深神经网络(DNN)推理中的主要效率提高,并用作DNN权重的片上存储器存储器。然而,在系统级别尚未完全理解IMC的功能灵活性限制及其对性能,能量和面积效率的影响。为了目标实际的端到端的IOT应用程序,IMC阵列必须括在异构可编程系统中,引入我们旨在解决这项工作的新系统级挑战。我们介绍了一个非均相紧密的聚类架构,整合了8个RISC-V核心,内存计算加速器(IMA)和数字加速器。我们在高度异构的工作负载上基准测试,例如来自MobileNetv2的瓶颈层,显示出11.5倍的性能和9.5倍的能效改进,而在核心上高度优化并行执行相比。此外,我们通过将我们的异构架构缩放到多阵列加速器,探讨了在IMC阵列资源方面对全移动级DNN(MobileNetv2)的端到端推断的要求。我们的结果表明,我们的解决方案在MobileNetv2的端到端推断上,在执行延迟方面比现有的可编程架构更好,比最先进的异构解决方案更好的数量级集成内存计算模拟核心。
translated by 谷歌翻译
Machine learning methods have revolutionized the discovery process of new molecules and materials. However, the intensive training process of neural networks for molecules with ever-increasing complexity has resulted in exponential growth in computation cost, leading to long simulation time and high energy consumption. Photonic chip technology offers an alternative platform for implementing neural networks with faster data processing and lower energy usage compared to digital computers. Photonics technology is naturally capable of implementing complex-valued neural networks at no additional hardware cost. Here, we demonstrate the capability of photonic neural networks for predicting the quantum mechanical properties of molecules. To the best of our knowledge, this work is the first to harness photonic technology for machine learning applications in computational chemistry and molecular sciences, such as drug discovery and materials design. We further show that multiple properties can be learned simultaneously in a photonic chip via a multi-task regression learning algorithm, which is also the first of its kind as well, as most previous works focus on implementing a network in the classification task.
translated by 谷歌翻译
The last few years have seen a lot of work to address the challenge of low-latency and high-throughput convolutional neural network inference. Integrated photonics has the potential to dramatically accelerate neural networks because of its low-latency nature. Combined with the concept of Joint Transform Correlator (JTC), the computationally expensive convolution functions can be computed instantaneously (time of flight of light) with almost no cost. This 'free' convolution computation provides the theoretical basis of the proposed PhotoFourier JTC-based CNN accelerator. PhotoFourier addresses a myriad of challenges posed by on-chip photonic computing in the Fourier domain including 1D lenses and high-cost optoelectronic conversions. The proposed PhotoFourier accelerator achieves more than 28X better energy-delay product compared to state-of-art photonic neural network accelerators.
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
Increasing popularity of deep-learning-powered applications raises the issue of vulnerability of neural networks to adversarial attacks. In other words, hardly perceptible changes in input data lead to the output error in neural network hindering their utilization in applications that involve decisions with security risks. A number of previous works have already thoroughly evaluated the most commonly used configuration - Convolutional Neural Networks (CNNs) against different types of adversarial attacks. Moreover, recent works demonstrated transferability of the some adversarial examples across different neural network models. This paper studied robustness of the new emerging models such as SpinalNet-based neural networks and Compact Convolutional Transformers (CCT) on image classification problem of CIFAR-10 dataset. Each architecture was tested against four White-box attacks and three Black-box attacks. Unlike VGG and SpinalNet models, attention-based CCT configuration demonstrated large span between strong robustness and vulnerability to adversarial examples. Eventually, the study of transferability between VGG, VGG-inspired SpinalNet and pretrained CCT 7/3x1 models was conducted. It was shown that despite high effectiveness of the attack on the certain individual model, this does not guarantee the transferability to other models.
translated by 谷歌翻译
能量收集(EH)间歇性地运行的IOT设备,与深神经网络(DNN)的进步相结合,为实现可持续智能应用开辟了新的机会。然而,由于有限的资源和间歇电源导致频繁故障的挑战,实现了EH设备上的那些计算和内存密集型智能算法非常困难。为了解决这些挑战,本文提出了一种方法,使得具有用于微小能量收集装置的低能量加速器的超快速深度学习。我们首先提出了一种资源感知结构化DNN训练框架,它采用块循环矩阵与ADMM实现高压缩和模型量化,以利用各种矢量操作加速器的优点。然后提出了一种DNN实现方法,即采用低能量加速器来利用具有较小能耗的最大性能的低能量加速器。最后,我们进一步设计Flex,系统支持在能量收集情况下间歇性计算。来自三种不同DNN模型的实验结果表明RAD,ACE和FLEX可以对能源收集设备进行超快速和正确的推断,该设备可降低高达4.26倍的运行时间,高达7.7倍的能量降低,高精度在最高的状态下艺术。
translated by 谷歌翻译
在当今的数据密集型时代,深度学习非常普遍。特别是,卷积神经网络(CNN)在各种领域被广泛采用,以获得卓越的准确性。但是,计算传统CPU和GPU的深入CNN带来了几种性能和能量陷阱。最近已经证明了基于ASIC,FPGA和电阻内存设备的几种新型方法,并有令人鼓舞的结果。他们中的大多数仅针对深度学习的推理(测试)阶段。尝试设计能够培训和推理的全面深度学习加速器的尝试非常有限。这是由于训练阶段的高度计算和记忆密集型性质。在本文中,我们提出了一种新型的模拟光子CNN加速器Litecon。 Litecon使用基于硅微波炉的卷积,基于备忘录的内存和密集波长 - 划分的稳定和超快深度学习。我们使用商业CAD框架(IPKISS)评估LiteCon,该框架(IPKISS)在包括Lenet和VGG-NET在内的深度学习基准模型上评估。与最先进的情况相比,LiteCon分别将CNN的吞吐量,能源效率和计算效率提高了32倍,37倍和5倍,并具有微不足道的精度降解。
translated by 谷歌翻译