Twitter is a new web application playing dual roles of online social networking and microblogging. Users communicate with each other by publishing text-based posts. The popularity and open structure of Twitter have attracted a large number of automated programs, known as bots, which appear to be a double-edged sword to Twitter. Legitimate bots generate a large amount of benign tweets delivering news and updating feeds, while malicious bots spread spam or malicious contents. More interestingly, in the middle between human and bot, there has emerged cyborg referred to either bot-assisted human or human-assisted bot. To assist human users in identifying who they are interacting with, this paper focuses on the classification of human, bot, and cyborg accounts on Twitter. We first conduct a set of large-scale measurements with a collection of over 500,000 accounts. We observe the difference among human, bot, and cyborg in terms of tweeting behavior, tweet content, and account properties. Based on the measurement results, we propose a classification system that includes the following four parts: 1) an entropy-based component, 2) a spam detection component, 3) an account properties component, and 4) a decision maker. It uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot, or cyborg. Our experimental evaluation demonstrates the efficacy of the proposed classification system.
translated by 谷歌翻译
人类本质上是社会性的。纵观历史,人们已经形成了社区并建立了关系。与同事,朋友和家人的大多数关系都是在面对面的互动中发展起来的。这些关系是通过明确的通信手段建立的,如词语和隐含的语调,肢体语言等。通过分析人际关系,我们可以得出有关关系和影响对话参与者的信息。然而,随着互联网的发展,人们开始通过在线社交网络中的文本进行交流。有趣的是,他们将他们的交际习惯带到了互联网上。 Manysocial网络用户彼此建立关系,并与领导者和追随者建立社区。认识到这些等级关系是一项重要任务,因为它将有助于理解社会网络并预测未来趋势,改进建议,更好地目标广告,并通过识别匿名恐怖组织的领导者来改善国家安全。在这项工作中,我概述了该领域的当前研究,并介绍了处理社交网络中识别等级关系问题的最先进方法。
translated by 谷歌翻译
Preface The performance of an Artificial Intelligence system often depends on the amount of world knowledge available to it. During the last decade, the AI community has witnessed the emergence of a number of highly structured knowledge repositories whose collaborative nature has led to a dramatic increase in the amount of world knowledge that can now be exploited in AI applications. Arguably, the best-known repository of user-contributed knowledge is Wikipedia. Since its inception less than eight years ago, it has become one of the largest and fastest growing on-line sources of encyclopedic knowledge. One of the reasons why Wikipedia is appealing to contributors and users alike is the richness of its embedded structural information: articles are hyperlinked to each other and connected to categories from an ever expanding taxonomy; pervasive language phenomena such as synonymy and polysemy are addressed through redirection and disambiguation pages; entities of the same type are described in a consistent format using infoboxes; related articles are grouped together in series templates. Many more repositories of user-contributed knowledge exist besides Wikipedia. Collaborative tagging in Delicious and community-driven question answering in Yahoo! Answers and Wiki Answers are only a few examples of knowledge sources that, like Wikipedia, can become a valuable asset for AI researchers. Furthermore, AI methods have the potential to improve these resources, as demonstrated recently by research on personalized tag recommendations, or on matching user questions with previously answered questions. The goal of this workshop was to foster the research and dissemination of ideas on the mutually beneficial interaction between AI and repositories of user-contributed knowledge. This volume contains papers accepted for presentation at the workshop. We issued calls for regular papers, short late-breaking papers, and demos. After careful review by the program committee of the 20 submissions received-13 regular papers, 6 short papers and 1 demo-5 regular papers and 3 short papers were accepted for presentation. Consistent with the original aim of the workshop, the accepted papers address a diverse set of problems and resources, although Wikipedia-based systems are still dominant. The accepted papers explore leveraging knowledge induced and patterns learned from Wikipedia and apply them to the web or untagged text collections, using such knowledge for tasks such as information extraction, entity disambiguation, terminology extraction and analysing the structure of social networks. We also learn of useful methods that integrate Wikipedia with structured resources, in particular relational databases. The members of the program committee provided high quality reviews in a timely fashion, and all submissions have benefited from this expert feedback. For a successful event, having high quality invited speakers is crucial. We were lucky to have two excellent speakers for
translated by 谷歌翻译
Little research exists on one of the most common, oldest, and most utilized forms of online social geographic information: the "location" field found in most virtual community user profiles. We performed the first in-depth study of user behavior with regard to the location field in Twitter user profiles. We found that 34% of users did not provide real location information, frequently incorporating fake locations or sarcastic comments that can fool traditional geographic information tools. When users did input their location, they almost never specified it at a scale any more detailed than their city. In order to determine whether or not natural user behaviors have a real effect on the "locatability" of users, we performed a simple machine learning experiment to determine whether we can identify a user's location by only looking at what that user tweets. We found that a user's country and state can in fact be determined easily with decent accuracy, indicating that users implicitly reveal location information, with or without realizing it. Implications for location-based services and privacy are discussed.
translated by 谷歌翻译
Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon , search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweight-ing, labels refinement, graph regularization, and feature-based. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.
translated by 谷歌翻译
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
translated by 谷歌翻译
The Internet has become a rich and large repository of information about us as individuals. Anything from the links and text on a user's homepage to the mailing lists the user subscribes to are reflections of social interactions a user has in the real world. In this paper we devise techniques and tools to mine this information in order to extract social networks and the exogenous factors underlying the networks' structure. In an analysis of two data sets, from Stanford University and the Massachusetts Institute of Technology (MIT), we show that some factors are better indicators of social connections than others, and that these indicators vary between user populations. Our techniques provide potential applications in automatically inferring real world connections and discovering, labeling, and characterizing communities.
translated by 谷歌翻译
Detecting anomalies in data is a vital task, with numerous high-impactapplications in areas such as security, finance, health care, and lawenforcement. While numerous techniques have been developed in past years forspotting outliers and anomalies in unstructured collections ofmulti-dimensional points, with graph data becoming ubiquitous, techniques forstructured {\em graph} data have been of focus recently. As objects in graphshave long-range correlations, a suite of novel technology has been developedfor anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overviewof the state-of-the-art methods for anomaly detection in data represented asgraphs. As a key contribution, we provide a comprehensive exploration of bothdata mining and machine learning algorithms for these {\em detection} tasks. wegive a general framework for the algorithms categorized under various settings:unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs,for attributed vs. plain graphs. We highlight the effectiveness, scalability,generality, and robustness aspects of the methods. What is more, we stress theimportance of anomaly {\em attribution} and highlight the major techniques thatfacilitate digging out the root cause, or the `why', of the detected anomaliesfor further analysis and sense-making. Finally, we present several real-worldapplications of graph-based anomaly detection in diverse domains, includingfinancial, auction, computer traffic, and social networks. We conclude oursurvey with a discussion on open theoretical and practical challenges in thefield.
translated by 谷歌翻译
In recent years, the reliability of information on the Internet has emerged as a crucial issue of modern society. Social network sites (SNSs) have revolutionized the way in which information is spread by allowing users to freely share content. As a consequence, SNSs are also increasingly used as vectors for the diffusion of misinformation and hoaxes. The amount of disseminated information and the rapidity of its diffusion make it practically impossible to assess reliability in a timely manner, highlighting the need for automatic hoax detection systems. As a contribution towards this objective, we show that Facebook posts can be classified with high accuracy as hoaxes or non-hoaxes on the basis of the users who "liked" them. We present two classification techniques, one based on logistic regression, the other on a novel adaptation of boolean crowdsourcing algorithms. On a dataset consisting of 15,500 Facebook posts and 909,236 users, we obtain classification accuracies exceeding 99% even when the training set contains less than 1% of the posts. We further show that our techniques are robust: they work even when we restrict our attention to the users who like both hoax and non-hoax posts. These results suggest that mapping the diffusion pattern of information can be a useful component of automatic hoax detection systems.
translated by 谷歌翻译
The unprecedented availability of social media data offers substantial opportunities for data owners, system operators , solution providers and end users to explore and understand social dynamics. However, the exponential growth in the volume, velocity, and variability of social media data prevents people from fully utilizing such data. Visual analytics, which is an emerging research direction, has received considerable attention in recent years. Many visual analytics methods have been proposed across disciplines to understand large-scale structured and unstructured social media data. This objective, however, also poses significant challenges for researchers to obtain a comprehensive picture of the area, understand research challenges, and develop new techniques. In this paper, we present a comprehensive survey to characterize this fast-growing area and summarize the state-of-the-art techniques for analyzing social media data. In particular, we classify existing techniques into two categories: gathering information and understanding user behaviors. We aim to provide a clear overview of the research area through the established taxonomy. We then explore the design space and identify the research trends. Finally, we discuss challenges and open questions for future studies.
translated by 谷歌翻译
Doxing is online abuse where a malicious party harms another by releasing identifying or sensitive information. Motivations for doxing include personal, competitive, and political reasons, and web users of all ages, genders and internet experience have been targeted. Existing research on doxing is primarily qualitative. This work improves our understanding of doxing by being the first to take a quantitative approach. We do so by designing and deploying a tool which can detect dox files and measure the frequency, content, targets, and effects of doxing on popular dox-posting sites. This work analyzes over 1.7 million text files posted to paste-bin.com, 4chan.org and 8ch.net, sites frequently used to share doxes online, over a combined period of approximately thirteen weeks. Notable findings in this work include that approximately 0.3% of shared files are doxes, that online social networking accounts mentioned in these dox files are more likely to close than typical accounts, that justice and revenge are the most often cited motivations for doxing, and that dox files target males more frequently than females. We also find that recent anti-abuse efforts by social networks have reduced how frequently these doxing victims closed or restricted their accounts after being attacked. We also propose mit-igation steps, such a service that can inform people when their accounts have been shared in a dox file, or law enforcement notification tools to inform authorities when individuals are at heightened risk of abuse.
translated by 谷歌翻译
Malicious domains are one of the major resources required for adversaries to run attacks over the Internet. Due to the important role of the Domain Name System (DNS), extensive research has been conducted to identify malicious domains based on their unique behavior reflected in different phases of the life cycle of DNS queries and responses. Existing approaches differ significantly in terms of intuitions, data analysis methods as well as evaluation methodologies. This warrants a thorough systematization of the approaches and a careful review of the advantages and limitations of every group. In this paper, we perform such an analysis. In order to achieve this goal, we present the necessary background knowledge on DNS and malicious activities leveraging DNS. We describe a general framework of malicious domain detection techniques using DNS data. Applying this framework, we categorize existing approaches using several orthogonal viewpoints, namely (1) sources of DNS data and their enrichment, (2) data analysis methods, and (3) evaluation strategies and metrics. In each aspect, we discuss the important challenges that the research community should address in order to fully realize the power of DNS data analysis to fight against attacks leveraging malicious domains.
translated by 谷歌翻译
The World Wide Web has grown to be a primary source of information for millions of people. Due to the size of the Web, search engines have become the major access point for this information. However, "commercial" search engines use hidden algorithms that put the integrity of their results in doubt, collect user data that raises privacy concerns, and target the general public thus fail to serve the needs of specific search users. Open source search, like open source operating systems, offers alternatives. The goal of the Open Source Information Retrieval Workshop (OSIR) is to bring together practitioners developing open source search technologies in the context of a premier IR research conference to share their recent advances, and to coordinate their strategy and research plans. The intent is to foster community-based development, to promote distribution of transparent Web search tools, and to strengthen the interaction with the research community in IR. A workshop about Open Source Web Information Retrieval was held last year in Compigne, France as part of WI 2005. The focus of this worksop is broadened to the whole open source information retrieval community. We want to thank all the authors of the submitted papers, the members of the program committee:, and the several reviewers whose contributions have resulted in these high quality proceedings. ABSTRACT There has been a resurgence of interest in index maintenance (or incremental indexing) in the academic community in the last three years. Most of this work focuses on how to build indexes as quickly as possible, given the need to run queries during the build process. This work is based on a different set of assumptions than previous work. First, we focus on latency instead of through-put. We focus on reducing index latency (the amount of time between when a new document is available to be indexed and when it is available to be queried) and query latency (the amount of time that an incoming query must wait because of index processing). Additionally, we assume that users are unwilling to tune parameters to make the system more efficient. We show how this set of assumptions has driven the development of the Indri index maintenance strategy, and describe the details of our implementation.
translated by 谷歌翻译
Blogs, often treated as the equivalence of online personal diaries, have become one of the fastest growing types of Web-based media. Everyone is free to express their opinions and emotions very easily through blogs. In the blogosphere, many communities have emerged, which include hate groups and racists that are trying to share their ideology, express their views, or recruit new group members. It is important to analyze these virtual communities, defined based on membership and subscription linkages, in order to monitor for activities that are potentially harmful to society. While many Web mining and network analysis techniques have been used to analyze the content and structure of the Web sites of hate groups on the Internet, these techniques have not been applied to the study of hate groups in blogs. To address this issue, we have proposed a semi-automated approach in this research. The proposed approach consists of four modules, namely blog spider, information extraction, network analysis, and visualization. We applied this approach to identify and analyze a selected set of 28 anti-Blacks hate groups (820 bloggers) on Xanga, one of the most popular blog hosting sites. Our analysis results revealed some interesting demographical and topological characteristics in these groups, and identified at least two large communities on top of the smaller ones. The study also demonstrated the feasibility in applying the proposed approach in the study of hate groups and other related communities in blogs. r
translated by 谷歌翻译
The proliferation and rapid diffusion of fake news on the Internet highlight the need of automatic hoax detection systems. In the context of social networks, machine learning (ML) methods can be used for this purpose. Fake news detection strategies are traditionally either based on content analysis (i.e. analyzing the content of the news) or-more recently-on social context models, such as mapping the news' diffusion pattern. In this paper, we first propose a novel ML fake news detection method which, by combining news content and social context features , outperforms existing methods in the literature, increasing their already high accuracy by up to 4.8%. Second, we implement our method within a Facebook Messenger chatbot and validate it with a real-world application, obtaining a fake news detection accuracy of 81.7%.
translated by 谷歌翻译
本文描述,开发和验证SciLens,一种评估科学新闻文章质量的方法。我们工作的出发点是结构化方法,定义了手动评估新闻的一系列质量方面。基于这些方面,我们描述了一系列新闻质量指标。根据我们的实验,这些指标可以帮助非专家更准确地评估科学新闻文章的质量,比较那些无法获得这些指标的专家。此外,SciLenscan还可用于为anarticle生成完全自动化的质量评分,与专家评估人员相比,非专家评估人员的评估更多。 SciLens的主要元素之一是关注文章的内容和背景,其中上下文由(1)文章对科学文献的明确和隐含参考,以及(2)社交媒体引用文章的反应提供。我们表明,两种语境元素都可以成为确定文章质量的重要信息来源。 SciLens的验证,通过专家和非专家注释的结合,证明了它对科学新闻的半自动和自动质量评估的有效性。
translated by 谷歌翻译
假新闻的爆炸式增长及其对民主,正义和公信力的侵蚀增加了对假新闻分析,侦查和干预的需求。本次调查全面,系统地回顾了虚假新闻研究。该调查确定并指定了各种学科的基本理论,例如心理学和社会科学,以促进对假新闻的跨学科研究。目前的假新闻研究已经过审查,总结和评估。这些研究从四个方面着眼于假新闻:(1)它所承载的虚假知识,(2)其写作风格,(3)其传播模式,以及(4)其创作者和推广者的可信度。我们用新闻及其传播者提供的各种可分析和可用信息,适应性的各种策略和框架以及适用的技术来描述每个视角。通过回顾假新闻研究中假新闻和未决问题的特点,我们在本次调查结束时强调了一些潜在的研究任务。
translated by 谷歌翻译