医疗应用通过要求高精度和易于解释来挑战当今的文本分类技术。尽管深度学习在准确性方面提供了一个飞跃,但这种飞跃是在牺牲可解释性的基础上实现的。为了解决这种准确性 - 可解释性挑战,我们首次引入了一种文本分类方法,该方法利用了最近引入的Tsetlin机器。简而言之,我们将文本的术语表示为命题变量。从这些,我们使用简单的命题公式捕获类别,例如:如果“rash”和“反应”和“青霉素”然后过敏。 Tsetlin机器从标签文本中学习这些公式,利用连接子句来表示每个类别的特定方面。实际上,即使没有术语(否定的特征)也可用于分类目的。我们与Na \'ive Bayes,决策树,线性支持向量机(SVM),随机森林,长期记忆(LSTM)神经网络和其他技术的实证比较是非常有用的.Tsetlin机器的表现与在20个新闻组和IMDb数据集上,以及非公共临床数据集上的所有评估方法都优于其他。平均而言,Tsetlin机器在数据集中提供最佳的召回率和精度分数。最后,我们的Tsetlin机器的GPU实现执行5比CPU实现快15倍,具体取决于数据集。因此我们相信,我们的novelapproach可以对广泛的文本分析应用产生重大影响,形成了使用Tsetlin机器进行更深层次自然语言理解的有希望的起点。
translated by 谷歌翻译
我们在对数损失下考虑使用专家建议进行预测,目标是实现高效且稳健的算法。我们认为现有算法如指数梯度,在线梯度下降和在线牛顿步骤不能充分满足这两个要求。我们的主要贡献是对Prod算法的分析,该算法对任何数据序列都是鲁棒的,并且相对于每轮中​​的专家数量在线性时间内运行。尽管对数损失具有无限的性质,但我们得出的界限与最大损失和最大梯度无关,并且仅取决于专家数量和时间范围。此外,我们给出了对贝叶斯的解释,并对该算法进行调整以得出跟踪遗憾。
translated by 谷歌翻译
We introduce two novel tree search algorithms that use a policy to guide search. The first algorithm is a best-first enumeration that uses a cost function that allows us to prove an upper bound on the number of nodes to be expanded before reaching a goal state. We show that this best-first algorithm is particularly well suited for "needle-in-a-haystack" problems. The second algorithm is based on sampling and we prove an upper bound on the expected number of nodes it expands before reaching a set of goal states. We show that this algorithm is better suited for problems where many paths lead to a goal. We validate these tree search algorithms on 1,000 computer-generated levels of Sokoban, where the policy used to guide the search comes from a neural network trained using A3C. Our results show that the policy tree search algorithms we introduce are competitive with a state-of-the-art domain-independent planner that uses heuristic search.
translated by 谷歌翻译
我们提出了一种基于随机化历史探索的多臂强盗算法。关键的想法是从其历史的bootstrap样本中估计手臂的值,我们在手臂的每个手臂之后添加伪观察值。伪观察似乎是有害的。但在相反的情况下,他们保证bootstrap样本具有高概率的乐观性。因此,我们将算法称为Giro,它是垃圾输入的缩写,奖励输出。我们在$ K $ -armedBernoulli强盗中分析Giro,并在其$ n $ -roundregret上证明$ O(K \ Delta ^ { - 1} \ log n)$ bound,其中$ \ Delta $表示预期的差异最佳和最佳次优武器的奖励。我们的探索策略的主要优点是它可以应用于任何奖励函数泛化,例如神经网络。我们评估Giro及其在多种合成和现实世界问题上的背景变体,并观察到Giro与最先进的算法相比或更好。
translated by 谷歌翻译
我们引入了一种新的在线排名模型,其中点击概率因子为检验和吸引力函数,吸引力函数是特征向量和未知参数的线性函数。只对检验函数进行相对温和的假设。分析了这种设置的新算法,表明对项目数量的依赖性取决于对维度的依赖性,允许新算法处理大量项目。当缩减为正交时,算法的遗憾改进了现有技术。
translated by 谷歌翻译
流相关是Tor上大量去异化攻击中使用的核心技术。尽管流量相关性攻击对Tor很重要,但现有的流量相关技术被认为在大规模应用时连接Tor流是无效且不可靠的,即,它们会产生高误报误差率或需要不切实际的长流量观测能够做出可靠的相关性。在本文中,遗憾的是,流量相关性攻击可以通过利用新兴学习机制在Tortraffic上进行,其准确度比以前高得多。我们特别设计了一个名为DeepCorr的系统,该系统在关联Torconnections时具有显着的优势。 DeepCorr利用先进的深度学习架构来学习为Tor的复杂网络量身定制的流量相关函数,这与以前的工作使用与Tocorrelated Tor流相关的通用统计相关度量不相符。我们表明,在适度学习的情况下,DeepCorr可以使Tor连接(并因此打破其匿名性)与现有算法的精确度高得多,并且使用相当短的流量观察长度。例如,通过仅收集每个目标Tor流量的约900个包(大约900KB的Tor数据),DeepCorr提供96%的流量相关准确度,相比之下,使用相同精确设置的RAPTOR的最新技术系统为4%。鉴于学习算法的最新进展,我们希望我们的工作能够证明流量相关性攻击对Tor的威胁不断升级,要求Tor社区及时部署有效的对策。
translated by 谷歌翻译
我们研究了重新排名的在线学习问题,用户提供反馈来提高显示列表的质量。学习排名已经在两个环境中进行了研究。在离线设置中,经常从法官的相关标签中学习。这些方法已成为行业标准。然而,它们缺乏探索,因此受到离线数据的信息内容的限制。在在线设置中,算法可以提出列表并以顺序方式从其反馈中学习。为此设置开发的泛函算法主动进行实验,并且在此克服了离线数据的偏差。但他们也倾向于忽略offlinedata,这导致初始探索成本高。我们提出了BubbleRank,一种用于重新排名的强盗算法,它结合了两种设置的优势。该算法从一个初始基础列表开始,并通过交换排名较低的吸引力较低的项目来逐步改进它。我们证明了BubbleRank的n步遗憾的上限,它会随着初始基本列表的质量而优雅地降级。通过对大型真实世界点击数据集的广泛数值实验,支持了理论发现。
translated by 谷歌翻译
This qualitative study examines privacy practices and concerns among contributors to open collaboration projects. We collected interview data from people who use the anonymity network Tor who also contribute to online projects and from Wikipedia editors who are concerned about their privacy to better understand how privacy concerns impact participation in open collaboration projects. We found that risks perceived by contributors to open collaboration projects include threats of surveillance, violence, harassment, opportunity loss, reputation loss, and fear for loved ones. We explain participants' operational and technical strategies for mitigating these risks and how these strategies affect their contributions. Finally, we discuss chilling effects associated with privacy loss, the need for open collaboration projects to go beyond attracting and educating participants to consider their privacy, and some of the social and technical approaches that could be explored to mitigate risk at a project or community level.
translated by 谷歌翻译
 Abstract-Tor is free software that enables anonymous communication. It defends users against traffic analysis and network surveillance. It is also useful for confidential business activities and state security. At the same time, anonymized protocols have been used to access criminal websites such as those dealing with illegal drugs. This paper proposes a new method for launching a fingerprinting attack to analyze Tor traffic in order to detect users who access illegal websites. Our new method is based on Stacked Denoising Autoencoder, a deep-learning technology. Our evaluation results show 0.88 accuracy in a closed-world test. In an open-world test, the true positive rate is 0.86 and the false positive rate is 0.02.
translated by 谷歌翻译
Website fingerprinting allows a local, passive observer monitoring a web-browsing client's encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.
translated by 谷歌翻译