research-article

Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets

Authors:
Joseph Chee Chang

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Saleema Amershi

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Ece Kamar

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing SystemsMay 2017Pages 2334–2346https://doi.org/10.1145/3025453.3026044

Published:02 May 2017Publication History

CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

Pages 2334–2346

ABSTRACT

Crowdsourcing provides a scalable and efficient way to construct labeled datasets for training machine learning systems. However, creating comprehensive label guidelines for crowdworkers is often prohibitive even for seemingly simple concepts. Incomplete or ambiguous label guidelines can then result in differing interpretations of concepts and inconsistent labels. Existing approaches for improving label quality, such as worker screening or detection of poor work, are ineffective for this problem and can lead to rejection of honest work and a missed opportunity to capture rich interpretations about data. We introduce Revolt, a collaborative approach that brings ideas from expert annotation workflows to crowd-based labeling. Revolt eliminates the burden of creating detailed label guidelines by harnessing crowd disagreements to identify ambiguous concepts and create rich structures (groups of semantically related items) for post-hoc label decisions. Experiments comparing Revolt to traditional crowdsourced labeling show that Revolt produces high quality labels without requiring label guidelines in turn for an increase in monetary cost. This up front cost, however, is mitigated by Revolt's ability to produce reusable structures that can accommodate a variety of label boundaries without requiring new data to be collected. Further comparisons of Revolt's collaborative and non-collaborative variants show that collaboration reaches higher label accuracy with lower monetary cost.

References

1995. The 20 Newsgroups Dataset. (1995). http://people.csail.mit.edu/jrennie/20Newsgroups/Google Scholar
Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 19--26. Google ScholarDigital Library
Omar Alonso, Catherine C Marshall, and Marc Najork. 2013. Are some tweets more interesting than others' hardquestion. In Proceedings of the Symposium on Human-Computer Interaction and Information Retrieval. ACM, 2.Google ScholarDigital Library
Paul André, Aniket Kittur, and Steven P Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 989--998.Google ScholarDigital Library
Yoram Bachrach, Thore Graepel, Tom Minka, and John Guiver. 2012. How to grade a test without knowing the answers--A Bayesian graphical model for adaptive crowdsourcing and aptitude testing. The proceedings of the International Conference on Machine Learning (2012).Google Scholar
Michael S. Bernstein, Joel Brandt, Robert C. Miller, and David R. Karger. 2011. Crowds in Two Seconds: Enabling Realtime Crowd-powered Interfaces. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (UIST '11). ACM, New York, NY, USA, 33--42. DOI: http://dx.doi.org/10.1145/2047196.2047201 Google ScholarDigital Library
Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, and others. 2010. VizWiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, 333--342.Google ScholarDigital Library
Jonathan Bragg, Daniel S Weld, and others. 2013. Crowdsourcing multi-label classification for taxonomy creation. In First AAAI conference on human computation and crowdsourcing.Google ScholarCross Ref
Michele A. Burton, Erin Brady, Robin Brewer, Callie Neylan, Jeffrey P. Bigham, and Amy Hurst. 2012. Crowdsourcing Subjective Fashion Advice Using VizWiz: Challenges and Opportunities. In Proceedings of the 14th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '12). ACM, New York, NY, USA, 135--142. DOI: http://dx.doi.org/10.1145/2384916.2384941 Google ScholarDigital Library
Chris Callison-Burch and Mark Dredze. 2010. Creating speech and language data with Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, 1--12.Google ScholarDigital Library
Joel Chan, Steven Dang, and Steven P Dow. 2016. Improving crowd innovation with expert facilitation. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 1223--1235. Google ScholarDigital Library
Joseph Chee Chang, Aniket Kittur, and Nathan Hahn. 2016. Alloy: Clustering with Crowds and Computation. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 3180--3191. Google ScholarDigital Library
Joseph Z. Chang, Jason S. Chang, and Jyh-Shing Roger Jang. 2012. Learning to Find Translations and Transliterations on the Web. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL '12). Association for Computational Linguistics, 130--134.Google Scholar
Lydia B Chilton, Juho Kim, Paul André, Felicia Cordeiro, James A Landay, Daniel S Weld, Steven P Dow, Robert C Miller, and Haoqi Zhang. 2014. Frenzy: collaborative data organization for creating conference sessions. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 1255--1264.Google ScholarDigital Library
Lydia B Chilton, Greg Little, Darren Edge, Daniel S Weld, and James A Landay. 2013. Cascade: Crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1999--2008.Google ScholarDigital Library
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2012. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-scale Entity Linking. In Proceedings of the 21st International Conference on World Wide Web (WWW '12). ACM, New York, NY, USA, 469--478. DOI: http://dx.doi.org/10.1145/2187836.2187900 Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 248--255.Google ScholarCross Ref
Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. 2013. Pick-a-crowd: Tell Me What You Like, and I'Ll Tell You What to Do. In Proceedings of the 22Nd International Conference on World Wide Web (WWW '13). ACM, New York, NY, USA, 367--374. DOI: http://dx.doi.org/10.1145/2488388.2488421 Google ScholarDigital Library
Shayan Doroudi, Ece Kamar, Emma Brunskill, and Eric Horvitz. 2016. Toward a Learning Science for Complex Crowdsourcing Tasks. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2623--2634. Google ScholarDigital Library
Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, 1013--1022. Google ScholarDigital Library
Ryan Drapeau, Lydia B Chilton, Jonathan Bragg, and Daniel S Weld. 2016. MicroTalk: Using Argumentation to Improve Crowdsourcing Accuracy. (2016).Google Scholar
Derek L Hansen, Patrick J Schone, Douglas Corey, Matthew Reid, and Jake Gehring. 2013. Quality control mechanisms for crowdsourcing: peer review, arbitration, & expertise at familysearch indexing. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 649--660.Google ScholarDigital Library
Google Inc. 2016. Google Search Quality Evaluator Guidelines. (2016). http://static.googleusercontent. com/media/google.com/en//insidesearch/howsearchworks/ assets/searchqualityevaluatorguidelines.pdfGoogle Scholar
Oana Inel, Khalid Khamkham, Tatiana Cristea, Anca Dumitrache, Arne Rutjes, Jelle van der Ploeg, Lukasz Romaszko, Lora Aroyo, and Robert-Jan Sips. 2014. Crowdtruth: Machine-human computation framework for harnessing disagreement in gathering annotated data. In International Semantic Web Conference. Springer, 486--504. Google ScholarDigital Library
Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation. ACM, 64--67.Google ScholarDigital Library
Sanjay Kairam and Jeffrey Heer. 2016. Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing. ACM, 1637--1648.Google ScholarDigital Library
Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 467--474.Google ScholarDigital Library
Joy Kim, Justin Cheng, and Michael S Bernstein. 2014. Ensemble: exploring complementary strengths of leaders and crowds in creative collaboration. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 745--755.Google ScholarDigital Library
Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 1301--1318. Google ScholarDigital Library
James Q Knowlton. 1966. On the definition of "picture". AV Communication Review 14, 2 (1966), 157--183.Google ScholarCross Ref
Ranjay A. Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, and Michael S. Bernstein. 2016. Embracing Error to Enable Rapid Crowdsourcing. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). ACM, New York, NY, USA, 3167--3179. DOI: http://dx.doi.org/10.1145/2858036.2858115 Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.Google Scholar
Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. 2014. Structured labeling for facilitating concept evolution in machine learning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3075--3084. Google ScholarDigital Library
Gierad Laput, Walter S Lasecki, Jason Wiese, Robert Xiao, Jeffrey P Bigham, and Chris Harrison. 2015. Zensors: Adaptive, rapidly deployable, human-intelligent sensor feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 1935--1944. Google ScholarDigital Library
Walter S Lasecki, Mitchell Gordon, Danai Koutra, Malte F Jung, Steven P Dow, and Jeffrey P Bigham. 2014. Glance: Rapidly coding behavioral video with the crowd. In Proceedings of the 27th annual ACM symposium on User interface software and technology. ACM, 551--562.Google ScholarDigital Library
Walter S Lasecki, Rachel Wesley, Jeffrey Nichols, Anand Kulkarni, James F Allen, and Jeffrey P Bigham. 2013. Chorus: a crowd-powered conversational assistant. In Proceedings of the 26th annual ACM symposium on User interface software and technology. ACM, 151--162.Google ScholarDigital Library
Kathleen M MacQueen, Eleanor McLellan, Kelly Kay, and Bobby Milstein. 1998. Codebook development for team-based qualitative analysis. Cultural anthropology methods 10, 2 (1998), 31--36.Google Scholar
A. Mao, Y. Chen, K.Z. Gajos, D.C. Parkes, A.D. Procaccia, and H. Zhang. 2012. TurkServer: Enabling Synchronous and Longitudinal Online Experiments. In Proceedings of HCOMP'12.Google Scholar
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics 19, 2 (1993), 313--330.Google ScholarDigital Library
Matt McGee. 2012. Yes, Bing Has Human Search Quality Raters and Here's How They Judge Web Pages. (2012). http://searchengineland.com/ bing-search-quality-rating-guidelines-130592Google Scholar
Brian McInnis, Dan Cosley, Chaebong Nam, and Gilly Leshed. 2016. Taking a HIT: Designing around Rejection, Mistrust, Risk, and WorkersaAZ A Experiences in Amazon Mechanical Turk. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 2271--2282. Google ScholarDigital Library
George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39--41.Google ScholarDigital Library
Tanushree Mitra, Clayton J Hutto, and Eric Gilbert. 2015. Comparing person-and process-centric strategies for obtaining quality data on amazon mechanical turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 1345--1354.Google ScholarDigital Library
Barbara Plank, Dirk Hovy, and Anders Søgaard. 2014. Linguistically debatable or just plain wrong?. In ACL (2). 507--511.Google Scholar
Matt Post, Chris Callison-Burch, and Miles Osborne. 2012. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation. Association for Computational Linguistics, 401--409.Google Scholar
Philip Resnik and Noah A. Smith. 2003. The Web As a Parallel Corpus. Comput. Linguist. 29, 3 (Sept. 2003), 349--380. DOI: http://dx.doi.org/10.1162/089120103322711578 Google ScholarDigital Library
Jakob Rogstadius, Vassilis Kostakos, Aniket Kittur, Boris Smus, Jim Laredo, and Maja Vukovic. 2011. An assessment of intrinsic and extrinsic motivation on task performance in crowdsourcing markets. ICWSM 11 (2011), 17--21.Google Scholar
Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, 55--62. Google ScholarDigital Library
Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and Fast-but is It Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, 254--263. http://dl.acm.org/citation.cfm?id=1613715.1613751Google ScholarDigital Library
Anselm Strauss and Juliet Corbin. 1998. Basics of qualitative research: Techniques and procedures for developing grounded theory . Sage Publications, Inc.Google Scholar
Ann Taylor, Mitchell Marcus, and Beatrice Santorini. 2003. The Penn treebank: an overview. In Treebanks. Springer, 5--22. Google ScholarCross Ref
Long Tran-Thanh, Trung Dong Huynh, Avi Rosenfeld, Sarvapali D. Ramchurn, and Nicholas R. Jennings. 2014. BudgetFix: Budget Limited Crowdsourcing for Interdependent Task Allocation with Quality Guarantees. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems (AAMAS '14). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 477--484. http://dl.acm.org/citation.cfm?id=2615731.2615809Google Scholar
Cynthia Weston, Terry Gandell, Jacinthe Beauchamp, Lynn McAlpine, Carol Wiseman, and Cathy Beauchamp. 2001. Analyzing interview data: The development and evolution of a coding system. Qualitative sociology 24, 3 (2001), 381--400. Google ScholarCross Ref
Janyce M Wiebe, Rebecca F Bruce, and Thomas P O'Hara. 1999. Development and use of a gold-standard data set for subjectivity classifications. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 246--253.Google ScholarDigital Library

Index Terms

Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets
1. Human-centered computing

Recommendations

Ballpark Crowdsourcing: The Wisdom of Rough Group Comparisons
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

Crowdsourcing has become a popular method for collecting labeled training data. However, in many practical scenarios traditional labeling can be difficult for crowdworkers(for example, if the data is high-dimensional or unintuitive, or the labels are ...
Read More
Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Read More
Label Aggregation with Clustering for Biased Crowdsourced Labeling
ICMLC '22: Proceedings of the 2022 14th International Conference on Machine Learning and Computing

With the rapid development of crowdsourcing learning, amount of label aggregation methods are proposed to infer the true labels of instances from multiple noisy labels provided by inexpert crowd workers. Most of the label aggregation methods take the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems
May 2017
7138 pages
ISBN:9781450346559
DOI:10.1145/3025453
General Chairs:
Gloria Mark
University of California Irvine
,
Susan Fussell
Cornell University
,
Program Chairs:
Cliff Lampe
University of Michigan
,
m.c. schraefel
University of Southampton
,
Juan Pablo Hourcade
University of Iowa
,
Caroline Appert
Université Paris-Sud
,
Daniel Wigdor
University of Toronto
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
collaboration
crowdsourcing
machine learning
real-time
Qualifiers
- research-article
Conference

Acceptance Rates
CHI '17 Paper Acceptance Rate600of2,400submissions,25%Overall Acceptance Rate6,199of26,314submissions,24%
More
Upcoming Conference
CHI '24

Sponsor:

sigchi

CHI Conference on Human Factors in Computing Systems

May 11 - 16, 2024

Honolulu , HI , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 149
  Total Citations
  View Citations
- 1,991
  Total Downloads
- Downloads (Last 12 months)251
- Downloads (Last 6 weeks)40
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets

CHI '17: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Ballpark Crowdsourcing: The Wisdom of Rough Group Comparisons

Transductive Multilabel Learning via Label Set Propagation

Label Aggregation with Clustering for Biased Crowdsourced Labeling