推文,博客文章或产品评论的情感极性变得极具吸引力,并在推荐系统,市场预测,商业智能等方面得到应用。深度学习技术正在成为分析此类文本的最佳表现者。然而,在文本挖掘和文本极化分析中有效地使用深度神经网络需要解决几个问题。首先,需要为深度神经网络提供大小和正确标记的数据集。其次,关于字嵌入向量的使用存在各种不确定性:它们是否应该从用于训练模型的相同数据集生成,还是更适合从大型和流行的集合中获取它们?第三,为了简化模型创建,使通用神经网络架构有效并且可以适应各种文本,封装大部分设计复杂性是很方便的。本文针对上述问题,提出了利用神经网络进行情感分析和实现最新技术成果的方法论实践见解。关于第一个问题,探讨了各种众包替代方案的有效性,并利用社交标准创建了双胞胎大小和情感标记的歌曲数据集。为了解决第二个问题,进行了一系列具有各种内容和域的大文本集的实验,尝试各种参数的插入。关于第三个问题,进行了一系列涉及卷积和最大汇集神经层的实验。将单词,双字母和三元组的卷积与几个堆栈中的区域最大汇集层相结合产生了最好的结果。派生体系结构在电影,商业和产品评论的情感极性分析中实现了竞争性表现。
translated by 谷歌翻译
The World Wide Web has grown to be a primary source of information for millions of people. Due to the size of the Web, search engines have become the major access point for this information. However, "commercial" search engines use hidden algorithms that put the integrity of their results in doubt, collect user data that raises privacy concerns, and target the general public thus fail to serve the needs of specific search users. Open source search, like open source operating systems, offers alternatives. The goal of the Open Source Information Retrieval Workshop (OSIR) is to bring together practitioners developing open source search technologies in the context of a premier IR research conference to share their recent advances, and to coordinate their strategy and research plans. The intent is to foster community-based development, to promote distribution of transparent Web search tools, and to strengthen the interaction with the research community in IR. A workshop about Open Source Web Information Retrieval was held last year in Compigne, France as part of WI 2005. The focus of this worksop is broadened to the whole open source information retrieval community. We want to thank all the authors of the submitted papers, the members of the program committee:, and the several reviewers whose contributions have resulted in these high quality proceedings. ABSTRACT There has been a resurgence of interest in index maintenance (or incremental indexing) in the academic community in the last three years. Most of this work focuses on how to build indexes as quickly as possible, given the need to run queries during the build process. This work is based on a different set of assumptions than previous work. First, we focus on latency instead of through-put. We focus on reducing index latency (the amount of time between when a new document is available to be indexed and when it is available to be queried) and query latency (the amount of time that an incoming query must wait because of index processing). Additionally, we assume that users are unwilling to tune parameters to make the system more efficient. We show how this set of assumptions has driven the development of the Indri index maintenance strategy, and describe the details of our implementation.
translated by 谷歌翻译
In the era of the Internet of Things (IoT), an enormous amount of sensing devices collect and/or generate various sensory data over time for a wide range of fields and applications. Based on the nature of the application, these devices will result in big or fast/real-time data streams. Applying analytics over such data streams to discover new information, predict future insights, and make control decisions is a crucial process that makes IoT a worthy paradigm for businesses and a quality-of-life improving technology. In this paper, we provide a thorough overview on using a class of advanced machine learning techniques, namely Deep Learning (DL), to facilitate the analytics and learning in the IoT domain. We start by articulating IoT data characteristics and identifying two major treatments for IoT data from a machine learning perspective, namely IoT big data analytics and IoT streaming data analytics. We also discuss why DL is a promising approach to achieve the desired analytics in these types of data and applications. The potential of using emerging DL techniques for IoT data analytics are then discussed, and its promises and challenges are introduced. We present a comprehensive background on different DL architectures and algorithms. We also analyze and summarize major reported research attempts that leveraged DL in the IoT domain. The smart IoT devices that have incorporated DL in their intelligence background are also discussed. DL implementation approaches on the fog and cloud centers in support of IoT applications are also surveyed. Finally, we shed light on some challenges and potential directions for future research. At the end of each section, we highlight the lessons learned based on our experiments and review of the recent literature.
translated by 谷歌翻译
本报告描述了18个项目,这些项目探讨了如何在国家实验室中将商业云计算服务用于科学计算。这些演示包括在云环境中部署专有软件,以利用已建立的基于云的分析工作流来处理科学数据集。总的来说,这些项目非常成功,并且他们共同认为云计算可以成为国家实验室科学计算的宝贵计算资源。
translated by 谷歌翻译
Visual analytics systems combine machine learning or other analytic techniques with interactive data visualization to promote sensemaking and analytical reasoning. It is through such techniques that people can make sense of large, complex data. While progress has been made, the tactful combination of machine learning and data visualization is still under-explored. This state-of-the-art report presents a summary of the progress that has been made by highlighting and synthesizing select research advances. Further, it presents opportunities and challenges to enhance the synergy between machine learning and visual analytics for impactful future research directions.
translated by 谷歌翻译
Big data: A survey
分类:
translated by 谷歌翻译
研究和实际应用的洪水将社交媒体数据用于广泛的公共应用,包括环境监测,水资源管理,灾难和应急响应。水文信息技术可以利用社交媒体技术,利用新出现的数据,技术和分析工具来处理大型数据集,本文首先提出了一个4W(What,Why,When,hoW)模型和方法结构,以更好地理解和表示社交媒体在水文信息学中的应用,然后提供应用社会的学术研究的概述。媒体到水文信息学,如水环境,水资源,洪水,干旱和水资源稀缺管理。最后,基于前面的讨论,水文信息管理人员和研究人员提出了数据收集,数据质量管理,虚假新闻检测,隐私问题,算法和平台等与水有关的社交媒体应用的一些高级主题和建议。
translated by 谷歌翻译
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.
translated by 谷歌翻译
As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sensor data we need to understand it. Collection, modelling, reasoning, and distribution of context in relation to sensor data plays critical role in this challenge. Context-aware computing has proven to be successful in understanding sensor data. In this paper, we survey context awareness from an IoT perspective. We present the necessary background by introducing the IoT paradigm and context-aware fundamentals at the beginning. Then we provide an in-depth analysis of context life cycle. We evaluate a subset of projects (50) which represent the majority of research and commercial solutions proposed in the field of context-aware computing conducted over the last decade (2001-2011) based on our own taxonomy. Finally, based on our evaluation, we highlight the lessons to be learnt from the past and some possible directions for future research. The survey addresses a broad range of techniques, methods, models, functionalities, systems, applications, and middleware solutions related to context awareness and IoT. Our goal is not only to analyse, compare and consolidate past research work but also to appreciate their findings and discuss their applicability towards the IoT.
translated by 谷歌翻译
In this paper, several strategies for cross-language image indexing and terminological glossary compilation are presented. The process starts form a source language indexed image. CBIR is proposed as a means to find similar images in target language documents in the web. The text surrounding the target matched image is chunked and the chunks are classified into concrete and abstract nouns by means of a discriminant analysis. The number of images retrieved by each chunk and the edit distance between each chunk and each image file name are taken as differentiating variables; a 74.4% rate of correctly classified labeled examples shows the adequacy of these variables. Nouns classified as concrete are used to retrieve images from the web and each retrieved image is compared with the image in the target document. When a positive matching occurs, the chunk used to retrieve the matched image is assigned as the index for the image in the target document and as the target language equivalent for the source image index. As the experiments are carried out in specialized domains, a systematic and recursive use of the approach is used to build terminological glossaries by storing images with their respective cross-language indices. Abstract In this paper we present some ongoing work and ideas on how to relate text-based semantics to images in web documents. We suggest the use of different levels of Natural Language Processing (NLP) to textual documents and speech transcripts associated to images for providing structured linguistic information that can be merged with available domain knowledge in order to generate additional semantic metadata for the images. An issue to be specifically addressed in the next future concerns the automation of the detection of relevant text/speech transcripts for a certain image (or video sequence). Beyond the time code approach, with its shortcomings, we expect from the discussion in this workshop on lexical characteristics of the language that can or should be used to describe image content an improvement of the approaches we are dealing with for the time being. Abstract In this paper, we describe an image collection created for the CLEF cross-language image retrieval track (ImageCLEF). This image retrieval benchmark (referred to as the IAPR TC-12 Benchmark) has developed from an initiative started by the Technical Committee 12 (TC-12) of the International Association of Pattern Recognition (IAPR). The collection consists of 20,000 images from a private photographic image collection. The construction and composition of the IAPR TC-12 Benchmark is described, including its associated text captions which are expressed in multiple languages, making the collection well-suited for evaluating the effectiveness of both text-based and visual retrieval methods. We also discuss the current and expected uses of the collection, including its use to benchmark and compare different image retrieval systems in ImageCLEF 2006. Abstract In the fie
translated by 谷歌翻译
国际标准化机构面临的问题越来越多,因为它们产生的标准的数量和规模都在增加。有时,负责制定标准的委员会之间缺乏协调可能会导致文件中的重叠,错误或不兼容。 。本研究的目的是提供一种方法,通过使用来自语言处理领域的语义工具,自动提取规范文档中的技术概念(术语)。论文的第一部分介绍了标准化世界,标准化结构,工作方式和面临的问题;然后,我们介绍了语义标注,信息提取和该领域中可用的软件工具的概念。下一节解释了本体论的概念及其在标准化领域的潜在用途。我们在此提出一种方法,该方法能够基于根据参考本体完成的语义注释过程从给定的规范语料库中提取技术信息。 ISO 15531 MANDATE语料库的应用程序提供了本文所述方法的第一个用例。本文最后描述了这种方法产生的第一个实验结果,以及一些问题和观点,特别是它对其他标准和/或技术委员会的应用以及创建预定义术语的可能性。
translated by 谷歌翻译
已经提出了自治机制来管理社会的某些方面,并且已经被用于管理商业组织。我们最近提出了关于社会算法规则的最新建议,并且我们确定了可用于实现它们的现有技术,这些技术最初是在商业环境中引入的。我们建立在“社会机器”的概念之上,我们将其与各种持续的趋势和想法联系起来,包括众包的任务工作,社交编译器,机制设计,声誉管理系统和社交评分。在展示了算法规则的所有构建块如何已经到位之后,我们讨论了人类自治和社会秩序的可能性。本文的主要贡献在于确定通过算法引导社会监管的融合社会和技术趋势,并讨论走这条道路可能带来的社会,政治和道德后果。
translated by 谷歌翻译
情绪通常是引人入胜的叙事的重要组成部分:关于有目标,欲望,激情和意图的人的文学作品。在过去,古典文学研究通常在解释学的框架内仔细审视文学的情感维度。然而,随着被称为数字人文学(DH)的研究领域的出现,对文学情境的一些研究已经进行了计算。鉴于DH仍然是一个科学形成的事实,这个研究方向可以变得相对新颖。与此同时,情感分析的研究在近二十年前就开始了语言化,现在已成为一个在主要计算语言学会议上有专门研讨会和轨道的既定领域。这引出了一个问题:情感分析研究计算语言学和数字人文学科之间的共性和差异是什么?在本次调查中,我们提供了对文献中应用的情感和情感分析研究现状的概述。我们在调查的主要部分之前简要介绍了自然语言处理和机器学习,情绪的心理模型,并提供了计算语言学中情感和情感分析的现有方法的概述。本调查中提供的论文要么直接来自DH,要么是计算语言学场所,仅限于应用于文学文本的情感和情感分析。
translated by 谷歌翻译
在这里,我们回顾了利用大数据和机器学习(ML)的前沿研究和创新方面,这两个计算机科学领域结合起来产生机器智能。 ML可以加速解决复杂的化学问题,甚至可以解决其他方面无法解决的问题。但ML的潜在好处是以大数据生产为代价的;也就是说,为了学习,算法需要来自不同来源的大量数据,来自材料属性传感器数据。在调查中,我们提出了未来发展的路线图,重点是材料发现和化学传感,并在物联网(IoT)的背景下,这两个领域都是MLin大数据背景的突出研究领域。除了概述最近的发展之外,我们还详细阐述了bigdata和ML应用于化学,概述过程,讨论陷阱以及回顾成功和失败案例的概念和实践限制。
translated by 谷歌翻译
The evaluation of artificial intelligence systems and components is crucial for the progress of the discipline. In this paper we describe and critically assess the different ways AI systems are evaluated , and the role of components and techniques in these systems. We first focus on the traditional task-oriented evaluation approach. We identify three kinds of evaluation: human discrimination, problem benchmarks and peer confrontation. We describe some of the limitations of the many evaluation schemes and competitions in these three categories, and follow the progression of some of these tests. We then focus on a less customary (and challenging) ability-oriented evaluation approach, where a system is characterised by its (cognitive) abilities, rather than by the tasks it is designed to solve. We discuss several possibilities: the adaptation of cognitive tests used for humans and animals, the development of tests derived from algorithmic information theory or more integrated approaches under the perspective of universal psychometrics. We analyse some evaluation tests from AI that are better positioned for an ability-oriented evaluation and discuss how their problems and limitations can possibly be addressed with some of the tools and ideas that appear within the paper. Finally, we enumerate a series of lessons learnt and generic guidelines to be used when an AI evaluation scheme is under consideration.
translated by 谷歌翻译
最近,许多人工智能研究人员和从业人员开始研究涉及为“好”做人工智能的研究。这是将人工智能研究和实践与道德思维融合在一起的一般驱动力的一部分。当前道德准则中的一个常见主题是要求AI对所有人都有益,或者:为共同利益做出贡献。但什么是共同利益,是否想要变得更好?通过四个引导问题,我将从AI的角度确定共同利益是什么以及如何通过AI增强它来说明挑战和陷阱。问题是:问题是什么/什么是问题?谁定义了问题?知识的作用是什么?,什么是重要的副作用和动态?该插图将使用“AI for Social Good”领域的一个例子,更具体地说是“社会善的数据科学”。即使这些问题的重要性可能在抽象层面上已知,但在实践中并没有得到充分的要求,正如对该领域近期会议的99项贡献的探索性研究所示。将这些挑战和陷阱转化为积极的建议,作为结论,我将借鉴计算机科学思想和实践的另一个特征,使这些障碍可见并减弱它们:“攻击”作为改进设计的方法。这导致了道德笔测试的提议,作为帮助AI设计更好地贡献共同利益的方法。
translated by 谷歌翻译