许多在世界上的许多语言的语言现有数据的非数字化书籍和文件锁定了。光学字符识别(OCR)可以用来产生数字化的文字,和以前的工作已经证明的是提高认识,精心资源较少语言的通用OCR系统的结果神经后校正方法的实用程序。然而,这些方法依赖于手工辅助校正后的数据,相对于非注释原始图像需要被数字化,其是相对稀少。在本文中,我们提出了一种半监督学习方法,使得它可以利用这些原始图像,以提高性能,特别是通过运用自我训练,其中模型迭代自身输出训练有素的技术。此外,为了执行在识别词汇的一致性,我们引入一个词法感知解码方法,该方法增强了神经后修正模型与从所识别的文本构成的基于计数的语言模型,使用加权有限状态自动机中实现(WFSA)对于高效和有效的解码。四种濒危语言的结果证明了该方法的效用,具有15-29%的相对误差减少,我们在哪里找到的自我培训和实现持续改善词法感知解码所必需的组合。数据和代码可在https://shrutirij.github.io/ocr-el/。
translated by 谷歌翻译
Deploying machine learning models in production may allow adversaries to infer sensitive information about training data. There is a vast literature analyzing different types of inference risks, ranging from membership inference to reconstruction attacks. Inspired by the success of games (i.e., probabilistic experiments) to study security properties in cryptography, some authors describe privacy inference risks in machine learning using a similar game-based style. However, adversary capabilities and goals are often stated in subtly different ways from one presentation to the other, which makes it hard to relate and compose results. In this paper, we present a game-based framework to systematize the body of knowledge on privacy inference risks in machine learning.
translated by 谷歌翻译
Sparsely gated Mixture of Experts (MoE) models have been shown to be a compute-efficient method to scale model capacity for multilingual machine translation. However, for low-resource tasks, MoE models severely over-fit. We show effective regularization strategies, namely dropout techniques for MoE layers in EOM and FOM, Conditional MoE Routing and Curriculum Learning methods that prevent over-fitting and improve the performance of MoE models on low-resource tasks without adversely affecting high-resource tasks. On a massively multilingual machine translation benchmark, our strategies result in about +1 chrF++ improvement in very low resource language pairs. We perform an extensive analysis of the learned MoE routing to better understand the impact of our regularization methods and how we can improve them.
translated by 谷歌翻译
Multilingual machine translation models can benefit from synergy between different language pairs, but also suffer from interference. While there is a growing number of sophisticated methods that aim to eliminate interference, our understanding of interference as a phenomenon is still limited. This work identifies the main factors that contribute to interference in multilingual machine translation. Through systematic experimentation, we find that interference (or synergy) are primarily determined by model size, data size, and the proportion of each language pair within the total dataset. We observe that substantial interference occurs mainly when the model is very small with respect to the available training data, and that using standard transformer configurations with less than one billion parameters largely alleviates interference and promotes synergy. Moreover, we show that tuning the sampling temperature to control the proportion of each language pair in the data is key to balancing the amount of interference between low and high resource language pairs effectively, and can lead to superior performance overall.
translated by 谷歌翻译
Time Series Forecasting has been an active area of research due to its many applications ranging from network usage prediction, resource allocation, anomaly detection, and predictive maintenance. Numerous publications published in the last five years have proposed diverse sets of objective loss functions to address cases such as biased data, long-term forecasting, multicollinear features, etc. In this paper, we have summarized 14 well-known regression loss functions commonly used for time series forecasting and listed out the circumstances where their application can aid in faster and better model convergence. We have also demonstrated how certain categories of loss functions perform well across all data sets and can be considered as a baseline objective function in circumstances where the distribution of the data is unknown. Our code is available at GitHub: https://github.com/aryan-jadon/Regression-Loss-Functions-in-Time-Series-Forecasting-Tensorflow.
translated by 谷歌翻译
会员推理(MI)攻击突出了当前神经网络随机培训方法中的隐私弱点。然而,它为什么出现。它们仅是不完美概括的自然结果吗?在培训期间,我们应该解决哪些根本原因以减轻这些攻击?为了回答此类问题,我们提出了第一种解释MI攻击及其基于原则性因果推理的概括的方法。我们提供因果图,以定量地解释以$ 6 $攻击变体获得的观察到的MI攻击性能。我们驳斥了几种先前的非量化假设,这些假设过于简化或过度估计潜在原因的影响,从而未能捕获几个因素之间的复杂相互作用。我们的因果模型还通过共同的因果因素显示了概括和MI攻击之间的新联系。我们的因果模型具有很高的预测能力($ 0.90 $),即它们的分析预测与经常看不见的实验中的观察结果相匹配,这使得通过它们的分析成为务实的替代方案。
translated by 谷歌翻译
大量工作表明,机器学习(ML)模型可以泄漏有关其培训数据的敏感或机密信息。最近,由于分布推断(或属性推断)攻击引起的泄漏正在引起人们的注意。在此攻击中,对手的目标是推断有关培训数据的分配信息。到目前为止,对分布推理的研究集中在证明成功的攻击上,而很少注意确定泄漏的潜在原因和提出缓解。为了弥合这一差距,作为我们的主要贡献,我们从理论和经验上分析了信息泄漏的来源,这使对手能够进行分布推理攻击。我们确定泄漏的三个来源:(1)记住有关$ \ mathbb {e} [y | x] $(给定特征值的预期标签)的特定信息,((2)模型的错误归纳偏置,以及(3)培训数据的有限性。接下来,根据我们的分析,我们提出了针对分配推理攻击的原则缓解技术。具体而言,我们证明了因果学习技术比相关学习方法更适合特定类型的分布推理所谓的分配构件推理。最后,我们提出了分布推断的形式化,该推论允许对比以前更多的一般对手进行推理。
translated by 谷歌翻译
神经形态计算机通过模拟人脑进行计算,并使用极低的功率。预计将来对于节能计算是必不可少的。尽管它们主要用于尖峰基于神经网络的机器学习应用程序,但已知神经形态计算机是Turing-Complete,因此能够进行通用计算。但是,为了充分意识到它们的通用,节能计算的潜力,重要的是要设计有效的编码数字机制。当前的编码方法的适用性有限,可能不适合通用计算。在本文中,我们将虚拟神经元视为整数和理性数字的编码机制。我们评估虚拟神经元在物理和模拟神经形态硬件上的性能,并表明它可以使用基于混合信号的Memristor神经形态处理器平均使用23 nj的能量执行加法操作。我们还通过在某些MU回复功能中使用它来证明其实用性,这些功能是通用计算的构建块。
translated by 谷歌翻译
在搜救任务中,广泛使用无人驾驶汽车来分发急救箱和食品包。重要的是,这些无人机能够识别和区分标记以进行有效分布。标记位置的常见方法之一是通过使用叠加在各种颜色形状上的字符,这些字符基于不同形状,角色及其各自颜色的组合而产生各种标记。在本文中,我们提出了一个对象检测和分类管道,该管道可防止误报,并最大程度地减少对空中图像中字母数字字符和形状的错误分类。我们的方法利用传统的计算机视觉技术和无监督的机器学习方法来识别区域建议,分割图像目标并删除误报。我们利用计算轻型模型进行分类,使其易于部署在任何航空车上。
translated by 谷歌翻译
集成电路(IC)的测试是一个非常昂贵的过程,但在确定IC的缺陷水平方面也是最重要的过程。 IC中的制造缺陷是使用符合故障模型对其进行建模的。拟合型号的模型涵盖了制造过程中发生的大多数物理故障。由于半导体技术的发展,功能尺寸降低,缺陷的尺寸也越来越小。这些难以检测的缺陷的测试是使用确定性测试生成(DTG)算法生成的。我们的工作旨在降低面向路径的决策成本:podem(DTG算法)而不损害测试质量。我们训练了一个元预测器,以选择给定电路和目标网的最佳模型。该合奏选择具有95%精度的最佳概率预测模型。从其CPU时间角度来看,这导致了回溯决策的数量减少,Podem的表现更好。我们表明,我们的ML引导的PoDEM算法具有元预测器的表现,其质量超过34%,而其他最先进的ML引导算法则至少高于ISCAS85基准电路的15%。
translated by 谷歌翻译