Opinion spam has become a widespread problem in the online review world, where paid or biased reviewers write fake reviews to elevate or relegate a product (or business) to mislead the consumers for profit or fame. In recent years, opinion spam detection has attracted a lot of attention from both the business and research communities. However, the problem still remains challenging as human labeling is expensive and hence labeled data is scarce, which is needed for supervised learning and evaluation. There exist recent works (e.g., FraudEagle , SpEagle ) which address the spam detection problem as an unsupervised network inference task on the review network. These methods are also able to incorporate labels (if available), and have been shown to achieve improved performance under the semi-supervised inference setting, in which the labels of a random sample of nodes are consumed. In this work, we address the problem of active inference for opinion spam detection. Active inference is the process of carefully selecting a subset of instances (nodes) whose labels are obtained from an oracle to be used during the (network) inference. Our goal is to employ a label acquisition strategy that selects a given number of nodes (a.k.a. the budget) wisely, as opposed to randomly, so as to improve detection performance significantly over the random selection. Our key insight is to select nodes that (i) exhibit high uncertainty, (ii) reside in a dense region, and (iii) are close-by to other uncertain nodes in the network. Based on this insight, we design a utility measure, called Expected UnCertainty Reach (EUCR), and pick the node with the highest EUCR score at every step iteratively. Experiments on two large real-world datasets from Yelp.com show that our method significantly outperforms random sampling as well as other state-of-the-art active inference approaches.
translated by 谷歌翻译