The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research on this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained on this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have "carved" different subsets out of this collection and tested their systems on one of these subsets only; systems that have been tested on different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic , comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested on these different subsets.
translated by 谷歌翻译
The problem of classification has been widely studied in the data mining, machine learning, database, and information retrieval communities with applications in a number of diverse domains, such as target marketing, medical diagnosis, news group filtering, and document organization. In this paper we will provide a survey of a wide variety of text classification algorithms.
translated by 谷歌翻译
Preface The performance of an Artificial Intelligence system often depends on the amount of world knowledge available to it. During the last decade, the AI community has witnessed the emergence of a number of highly structured knowledge repositories whose collaborative nature has led to a dramatic increase in the amount of world knowledge that can now be exploited in AI applications. Arguably, the best-known repository of user-contributed knowledge is Wikipedia. Since its inception less than eight years ago, it has become one of the largest and fastest growing on-line sources of encyclopedic knowledge. One of the reasons why Wikipedia is appealing to contributors and users alike is the richness of its embedded structural information: articles are hyperlinked to each other and connected to categories from an ever expanding taxonomy; pervasive language phenomena such as synonymy and polysemy are addressed through redirection and disambiguation pages; entities of the same type are described in a consistent format using infoboxes; related articles are grouped together in series templates. Many more repositories of user-contributed knowledge exist besides Wikipedia. Collaborative tagging in Delicious and community-driven question answering in Yahoo! Answers and Wiki Answers are only a few examples of knowledge sources that, like Wikipedia, can become a valuable asset for AI researchers. Furthermore, AI methods have the potential to improve these resources, as demonstrated recently by research on personalized tag recommendations, or on matching user questions with previously answered questions. The goal of this workshop was to foster the research and dissemination of ideas on the mutually beneficial interaction between AI and repositories of user-contributed knowledge. This volume contains papers accepted for presentation at the workshop. We issued calls for regular papers, short late-breaking papers, and demos. After careful review by the program committee of the 20 submissions received-13 regular papers, 6 short papers and 1 demo-5 regular papers and 3 short papers were accepted for presentation. Consistent with the original aim of the workshop, the accepted papers address a diverse set of problems and resources, although Wikipedia-based systems are still dominant. The accepted papers explore leveraging knowledge induced and patterns learned from Wikipedia and apply them to the web or untagged text collections, using such knowledge for tasks such as information extraction, entity disambiguation, terminology extraction and analysing the structure of social networks. We also learn of useful methods that integrate Wikipedia with structured resources, in particular relational databases. The members of the program committee provided high quality reviews in a timely fashion, and all submissions have benefited from this expert feedback. For a successful event, having high quality invited speakers is crucial. We were lucky to have two excellent speakers for
translated by 谷歌翻译
This paper describes the PASCAL Network of Excellence Recognising Textual Entailment (RTE) Challenge benchmark 1. The RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred (en-tailed) from the other. This application-independent task is suggested as capturing major inferences about the variability of semantic expression which are commonly needed across multiple applications. The Challenge has raised noticeable attention in the research community, attracting 17 submissions from diverse groups, suggesting the generic relevance of the task.
translated by 谷歌翻译
translated by 谷歌翻译
This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categorization problems based on a predefined hierarchical structure. The final classifier is a hierarchical array of neural networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Comparisons with an optimized version of the traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers are provided. The results show that the use of the hierarchical structure improves text categorization performance with respect to an equivalent flat model. The optimized Rocchio algorithm achieves a performance comparable with that of the hierarchical neural networks.
translated by 谷歌翻译
Two recently implemented machine-learning algorithms, RIPPER and sleeping-experts for phrases, are evaluated on a number of large text categorization problems. These algorithms both construct classifiers that allow the "context" of a word w to affect how (or even whether) the presence or absence of w will contribute to a classification. However, RIPPER and sleeping-experts differ radically in many other respects: differences include different notions as to what constitutes a context, different ways of combining contexts to construct a classifier, different methods to search for a combination of contexts, and different criteria as to what contexts should be included in such a combination. In spite of these differences, both RIPPER and sleeping-experts perform extremely well across a wide variety of categorization problems, generally outperforming previously applied learning methods. We view this result as a confirmation of the usefulness of classifiers that represent contextual information.
translated by 谷歌翻译
The World Wide Web has grown to be a primary source of information for millions of people. Due to the size of the Web, search engines have become the major access point for this information. However, "commercial" search engines use hidden algorithms that put the integrity of their results in doubt, collect user data that raises privacy concerns, and target the general public thus fail to serve the needs of specific search users. Open source search, like open source operating systems, offers alternatives. The goal of the Open Source Information Retrieval Workshop (OSIR) is to bring together practitioners developing open source search technologies in the context of a premier IR research conference to share their recent advances, and to coordinate their strategy and research plans. The intent is to foster community-based development, to promote distribution of transparent Web search tools, and to strengthen the interaction with the research community in IR. A workshop about Open Source Web Information Retrieval was held last year in Compigne, France as part of WI 2005. The focus of this worksop is broadened to the whole open source information retrieval community. We want to thank all the authors of the submitted papers, the members of the program committee:, and the several reviewers whose contributions have resulted in these high quality proceedings. ABSTRACT There has been a resurgence of interest in index maintenance (or incremental indexing) in the academic community in the last three years. Most of this work focuses on how to build indexes as quickly as possible, given the need to run queries during the build process. This work is based on a different set of assumptions than previous work. First, we focus on latency instead of through-put. We focus on reducing index latency (the amount of time between when a new document is available to be indexed and when it is available to be queried) and query latency (the amount of time that an incoming query must wait because of index processing). Additionally, we assume that users are unwilling to tune parameters to make the system more efficient. We show how this set of assumptions has driven the development of the Indri index maintenance strategy, and describe the details of our implementation.
translated by 谷歌翻译
在本文中,我们报告了我们对文本数据密集分布表示的研究结果。我们提出了两种新颖的神经模型来学习这种表征。第一个模型学习文档级别的表示,而第二个模型学习单词级表示。对于文档级表示,我们提出二进制段落向量:用于学习文本文档的二进制表示的神经网络模型,其可用于快速文档检索。我们对这些模型进行了全面评估,并证明它们在信息检索任务中的表现优于该领域的开创性方法。我们还报告了强有力的结果转换学习设置,其中我们的模型在通用textcorpus上训练,然后用于从特定于域的数据集推断文档的代码。与先前提出的方法相反,二进制段落矢量模型直接从原始文本数据学习嵌入。对于词级表示,我们提出消歧Skip-gram:用于学习多义词嵌入的神经网络模型。通过该模型学习的表示可以用于下游任务,例如词性标记或语义关系的识别。在单词意义上感应任务Disambiguated Skip-gram在三个基准测试数据集上优于最先进的模型。我们的模型具有优雅的概率解释。此外,与以前的这种模型不同,它在所有参数方面都是不同的,并且可以用反向传播进行训练。除了定量结果,我们还提出消除歧义的Skip-gram的定性评估,包括选定的词义嵌入的二维可视化。
translated by 谷歌翻译
本文基于拉曼光谱,研究了单侧分类算法在将有害氯化溶剂与其他物质分离的应用中的应用。使用从头开始设计和开发的新的单侧分类工具包进行实验。在片面分类范例中,目标是将目标类的元素与所有异常值分开。在实践中,当训练样本中存在某种缺陷时,通常选择这些单侧分类器。有时,异常的例子很少见,标签价格昂贵,甚至完全不存在。然而,这位作者想要表明,当异常值的例子丰富但是在统计上并不能代表完整的异常概念时,它们同样适用。正是这种情况在本研究工作中明确得到了解决。在这些情况下,已经发现单侧分类器比传统的多类分类器更加坚固。引入术语“意外”异常值来表示在测试集中遇到的异常值示例,这些示例已从不同的分布中获取到训练样本。这些示例是训练集中所有可能的异常值表示不充分的结果。鉴于它们可以代表不是目标的“其他所有”的可测量数量这一事实,通常不可能完全表征异常值的例子。这项研究的结果表明,当测试数据来自与训练样本的完全不同的分布时,使用传统的多类分类算法的潜在缺点。
translated by 谷歌翻译