Crowdsourcing enables one to leverage on the intelligence and wisdom of potentially large groups of individuals toward solving problems. Common problems approached with crowdsourcing are labeling images, translating or transcribing text, providing opinions or ideas, and similar-all tasks that computers are not good at or where they may even fail altogether. The introduction of humans into computations and/or everyday work, however, also poses critical, novel challenges in terms of quality control, as the crowd is typically composed of people with unknown and very diverse abilities, skills, interests, personal objectives and technological resources. This survey studies quality in the context of crowdsourcing along several dimensions, so as to define and characterize it and to understand the current state of the art. Specifically, this survey derives a quality model for crowdsourcing tasks, identifies the methods and techniques that can be used to assess the attributes of the model, and the actions and strategies that help prevent and mitigate quality problems. An analysis of how these features are supported by the state of the art further identifies open issues and informs an outlook on hot future research directions.
translated by 谷歌翻译
Crowdsourcing platforms are a popular choice for researchers to gather text annotations quickly at scale. We investigate whether crowdsourced annotations are useful when the labeling task requires medical domain knowledge. Comparing a sentence classification model trained with expert-annotated sentences to the same model trained on crowd-labeled sentences, we find the crowdsourced training data to be just as effective as the manually produced dataset. We can improve the accuracy of the crowd-fueled model without collecting further labels by filtering out worker labels applied with low confidence.
translated by 谷歌翻译
普遍有效的基本事实几乎是不可能获得的,或者会花费很高的成本。对于没有普遍有效的基础的监督学习,推荐的方法是应用众包:收集由具有不同可能专业水平的多个人注释的大型数据集,并推断地面实况数据用作标记来训练分类器。然而,由于敏感性手头的问题(例如乳腺癌组织学图像中的有丝分裂检测),在用于分类器训练之前获得的数据需要验证和适当的评估。即使在有机计算系统的背景下,也不一定存在无可争辩的基础事实。因此,应通过对每个自治机构的本地知识的聚集和验证来推断。
translated by 谷歌翻译
We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by A preliminary version of this work appeared in ECCV 2010 by Vondrick et al. extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult , real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.
translated by 谷歌翻译
We introduce tools and methodologies to collect high quality, large scale fine-grained computer vision datasets using citizen scientists-crowd annotators who are passionate and knowledgeable about specific domains such as birds or airplanes. We worked with citizen scientists and domain experts to collect NABirds, a new high quality dataset containing 48,562 images of North American birds with 555 categories, part annotations and bounding boxes. We find that citizen scientists are significantly more accurate than Mechanical Turkers at zero cost. We worked with bird experts to measure the quality of popular datasets like CUB-200-2011 and ImageNet and found class label error rates of at least 4%. Nevertheless, we found that learning algorithms are surprisingly robust to annotation errors and this level of training data corruption can lead to an acceptably small increase in test error if the training set has sufficient size. At the same time, we found that an expert-curated high quality test set like NABirds is necessary to accurately measure the performance of fine-grained computer vision systems. We used NABirds to train a publicly available bird recognition service deployed on the web site of the Cornell Lab of Ornithology. 1
translated by 谷歌翻译
机器学习(ML)算法在医学成像领域产生了巨大的影响。虽然医学成像数据集的规模在不断扩大,但经常提到的监督ML算法的挑战是缺乏注释数据。结果,已经提出了可以用更少/更多种类的监督来学习的各种方法。我们回顾了医学成像中的半监督,多实例和转移学习,包括诊断/检测或分割任务。我们还讨论了这些学习情景与未来研究机会之间的联系。
translated by 谷歌翻译
Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.
translated by 谷歌翻译
Human linguistic annotation is crucial for many natural language processing tasks but can be expensive and time-consuming. We explore the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web. We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation. For all five, we show high agreement between Mechanical Turk non-expert annotations and existing gold standard labels provided by expert label-ers. For the task of affect recognition, we also show that using non-expert labels for training machine learning algorithms can be as effective as using gold standard annotations from experts. We propose a technique for bias correction that significantly improves annotation quality on two tasks. We conclude that many large labeling tasks can be effectively designed and carried out in this method at a fraction of the usual expense.
translated by 谷歌翻译
我们提出了一个名为C {$ ^ 2 $} A的医疗众包可视化分析平台,用于对众包临床数据进行可视化,分类和过滤。更具体地说,C $ ^ 2 $ A用于通过观察人群响应和过滤异常活动来建立临床诊断的共识。最近,众包医疗应用已经显示出非专业用户(人群)能够达到与医学专家类似的准确性的前景。通过事先就研究结果达成共识并让医学专家做出最终诊断,这有可能减少口译/阅读时间并可能提高准确性。在本文中,我们将重点放在虚拟结肠镜(VC)应用上,临床技术人员作为我们的目标用户,放射科医师充当顾问并将细分市场分类为良性恶性肿瘤。特别是,C $ ^ 2 $ A用于分析和探索视频片段的人群响应,这些视频片段是通过虚拟冒号中的飞越创建的.C $ ^ 2 $ A提供了几个交互式可视化组件,用于在视频片段上建立群体共识,以便检测人群数据和VC视频片段中的异常,最后,通过对最佳众包平台和特定应用参数的A / B测试来提高非专家用户的工作质量和性能。案例研究和领域专家反馈证明了我们框架在提高工人产出质量方面的有效性,减少放射科医师解释时间的潜力,以及通过将大多数视频片段标记为良性而改善传统临床工作流程的潜力共识。
translated by 谷歌翻译
In recent years, tremendous progress has been made in surgical practice for example with Minimally Invasive Surgery (MIS). To overcome challenges coming from deported eye-to-hand manipulation, robotic and computer-assisted systems have been developed. Having real-time knowledge of the pose of surgical tools with respect to the surgical camera and underlying anatomy is a key ingredient for such systems. In this paper, we present a review of the literature dealing with vision-based and marker-less surgical tool detection. This paper includes three primary contributions: (1) identification and analysis of data-sets used for developing and testing detection algorithms, (2) in-depth comparison of surgical tool detection methods from the feature extraction process to the model learning strategy and highlight existing shortcomings, and (3) analysis of validation techniques employed to obtain detection performance results and establish comparison between surgical tool detectors. The papers included in the review were selected through PubMed and Google Scholar searches using the keywords: "surgical tool detection", "surgical tool tracking", "surgical instrument detection" and "surgical instrument tracking" limiting results to the year range 2000-2015. Our study shows that despite significant progress over the years, the lack of established surgical tool data-sets, and reference format for performance assessment and method ranking is preventing faster improvement.
translated by 谷歌翻译
One of the major bottlenecks in the development of data-driven AI Systems is the cost of reliable human annotations. The recent advent of several crowdsourcing platforms such as Amazon's Mechanical Turk, allowing re-questers the access to affordable and rapid results of a global workforce, greatly facilitates the creation of massive training data. Most of the available studies on the effectiveness of crowdsourcing report on English data. We use Mechanical Turk annotations to train an Opinion Mining System to classify Spanish consumer comments. We design three different Human Intelligence Task (HIT) strategies and report high inter-annotator agreement between non-experts and expert annotators. We evaluate the advantages/drawbacks of each HIT design and show that, in our case, the use of non-expert annotations is a viable and cost-effective alternative to expert annotations.
translated by 谷歌翻译
Recommender systems have been evaluated in many, often incomparable, ways. In this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. In addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. Metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated.
translated by 谷歌翻译
Medical imaging is fundamental to modern healthcare, and its widespread use has resulted in the creation of image databases, as well as picture archiving and communication systems. These repositories now contain images from a diverse range of modalities, multidimensional (three-dimensional or time-varying) images, as well as co-aligned multimodality images. These image collections offer the opportunity for evidence-based diagnosis, teaching, and research; for these applications, there is a requirement for appropriate methods to search the collections for images that have characteristics similar to the case(s) of interest. Content-based image retrieval (CBIR) is an image search technique that complements the conventional text-based retrieval of images by using visual features, such as color, texture, and shape, as search criteria. Medical CBIR is an established field of study that is beginning to realize promise when applied to multidimensional and multimodality medical data. In this paper, we present a review of state-of-the-art medical CBIR approaches in five main categories: two-dimensional image retrieval, retrieval of images with three or more dimensions, the use of nonimage data to enhance the retrieval, multimodality image retrieval, and retrieval from diverse datasets. We use these categories as a framework for discussing the state of the art, focusing on the characteristics and modalities of the information used during medical image retrieval.
translated by 谷歌翻译
The proliferation of misinformation in online news and its amplification by platforms are a growing concern, leading to numerous efforts to improve the detection of and response to misinforma-tion. Given the variety of approaches, collective agreement on the indicators that signify credible content could allow for greater collaboration and data-sharing across initiatives. In this paper, we present an initial set of indicators for article credibility defined by a diverse coalition of experts. These indicators originate from both within an article's text as well as from external sources or article metadata. As a proof-of-concept, we present a dataset of 40 articles of varying credibility annotated with our indicators by 6 trained annotators using specialized platforms. We discuss future steps including expanding annotation, broadening the set of indicators, and considering their use by platforms and the public, towards the development of interoperable standards for content credibility. This paper is published under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. In case of republi-cation, reuse, etc., the following attribution should be used:
translated by 谷歌翻译
在线众包提供了一种可扩展且廉价的手段来收集关于各种类型的数据项(例如文本,音频,视频)的知识(例如标签)。然而,还已知导致记录响应的质量存在很大差异,这通常不能直接用于训练机器学习系统。为了解决这个问题,已经进行了许多工作来控制响应质量,使得低质量响应不会对机器学习系统的性能产生不利影响。这种工作被称为众包的质量控制。过去的质量控制研究可分为两大部分:质量控制机制设计和统计模型。第一个分支侧重于设计支付,游戏化,问题分配和影响工人行为的其他机制的措施,阈值,界面和工作流程。第二个分支侧重于开发统计模型以执行响应的有效聚合以推断正确的响应。这两个分支作为统计模型连接(i)提供参数估计以支持度量和阈值计算,以及(ii)编码用于推导(理论上)机制性能保证的建模假设。有关于每个分支的调查,但他们缺乏关于其他分支的技术细节。我们的调查是第一个通过提供技术细节来弥合两个分支机构的第一个,这些框架系统地统一了由两者组成的众包方面,以确定响应质量。我们也是第一个根据提议的框架提供质量控制论文分类的人。最后,我们详细说明了质量控制研究的当前局限性和相应的未来方向。
translated by 谷歌翻译
Variations in the shape and appearance of anatomical structures in medical images are often relevant radiological signs of disease. Automatic tools can help automate parts of this manual process. A cloud-based evaluation framework is presented in this paper including results of benchmarking current state-of-the-art medical imaging algorithms for anatomical structure segmentation and landmark detection: the VISCERAL Anatomy benchmarks. The algorithms are implemented in virtual machines in the cloud where participants can only access the training data and can be run privately by the benchmark administrators to objectively compare their performance in an unseen common test set. Overall, 120 computed tomography and magnetic resonance patient volumes were manually annotated to create a standard Gold Corpus containing a total of 1295 structures and 1760 landmarks. Ten participants contributed with automatic algorithms for the organ segmentation task, and three for the landmark localization task. Different algorithms obtained the best scores in the four available imaging modalities and for subsets of anatomical structures. The annotation framework, resulting data set, evaluation setup, results and performance analysis from the three VISCERAL Anatomy benchmarks are presented in this article. Both the VISCERAL data set and Silver Corpus generated with the fusion of the participant algorithms on a larger set of non-manually-annotated medical images are available to the research community.
translated by 谷歌翻译
ImageNet大规模视觉识别挑战是数百个对象类别和数百万图像的基准对象类别分类和检测。挑战每年都从2010年开始,吸引了来自50多个机构的参与。本文描述了这个基准数据集的创建以及结果可能实现的对象识别的进展。我们讨论了收集大规模地面实况注释的挑战,突出了分类对象识别中的突破性,提供了对大规模图像分类和对象检测领域当前状态的详细分析,并比较了最先进的计算机视觉准确性与人类准确性。我们总结了五年挑战中的经验教训,并提出了未来的方向和改进。
translated by 谷歌翻译
由于担心众包中的人为错误,标准实践是从多个互联网工作者收集相同数据点的标签。 Wehere表明,通过灵活的工人分配策略可以更有效地使用由此产生的预算,该策略要求较少的工人分析易于标签的数据,而更多的工人分析需要额外审查的数据。我们的主要贡献是展示如何在不使用工作人员配置文件的情况下,仅根据任务功能以最佳方式计算工作人员数量的分配。我们的目标任务是在显微镜图像中描绘细胞,并通过推文分析对2016年美国总统候选人的情绪。我们首先提出一种计算预算优化的人群工人分配(BUOCA)的算法。接下来,我们将培训机器学习系统(BUOCA-ML),该系统可预测最佳的人群工作人员数量,从而最大限度地提高标签的准确性。我们表明,计算分配可以大大节省众包预算(高达49个百分点),同时保持标签准确性。最后,我们设想了一种人机系统,用于在众包的可行性范围之外进行预算优化的数据分析。
translated by 谷歌翻译
In this paper we give an introduction to using Amazon's Mechanical Turk crowdsourc-ing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL-2010 Workshop. 24 researchers participated in the workshop's shared task to create data for speech and language applications with $100.
translated by 谷歌翻译