ABSTRACT
Crowdsourcing database systems have been proposed to leverage crowd-powered operations to encapsulate the complexities of interacting with the crowd. Existing systems suffer from two major limitations. Firstly, in order to optimize a query, they often adopt the traditional tree model to select an optimized table-level join order. However, the tree model provides a coarse-grained optimization, which generates the same order for different joined tuples and limits the optimization potential that different joined tuples can be optimized by different orders. Secondly, they mainly focus on optimizing the monetary cost. In fact, there are three optimization goals (i.e., smaller monetary cost, lower latency, and higher quality) in crowdsourcing, and it calls for a system to enable multi-goal optimization.
To address the limitations, we develop a crowd-powered database system CDB that supports crowd-based query optimizations, with focus on join and selection. CDB has fundamental differences from existing systems. First, CDB employs a graph-based query model that provides more fine-grained query optimization. Second, CDB adopts a unified framework to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on AMT, CrowdFlower and ChinaCrowd. We have also created a benchmark for evaluating crowd-powered databases. We have conducted both simulated and real experiments, and the experimental results demonstrate the performance superiority of CDB on cost, latency and quality.
- Amazon mechanical turk. https://www.mturk.com/.Google Scholar
- Chinacrowd. http://www.chinacrowds.com/.Google Scholar
- Crowdflower. http://www.crowdflower.com.Google Scholar
- Dbpedia. https://www.dbpedia.org.Google Scholar
- Yago. https://www.mpi-inf.mpg.de.Google Scholar
- A. Alfarrarjeh, T. Emrich, and C. Shahabi. Scalable spatial crowdsourcing: A study of distributed algorithms. In MDM, pages 134--144, 2015. Google ScholarDigital Library
- Y. Amsterdamer, S. B. Davidson, T. Milo, S. Novgorodov, and A. Somech. Ontology assisted crowd mining. PVLDB, 7(13):1597--1600, 2014. Google ScholarDigital Library
- Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, pages 241--252. ACM, 2013. Google ScholarDigital Library
- A.P.Dawid and A.M.Skene. Maximum likelihood estimation of observer error-rates using em algorithm. Appl.Statist., 28(1):20--28, 1979.Google ScholarCross Ref
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
- R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and W.-C. Tan. Asking the right questions in crowd data sourcing. In ICDE'12. Google ScholarDigital Library
- D. G. Brizan and A. U. Tansel. A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5, 2015.Google ScholarCross Ref
- C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD, pages 969--984, 2016. Google ScholarDigital Library
- X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In WSDM, pages 193--202, 2013. Google ScholarDigital Library
- X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261. ACM, 2015. Google ScholarDigital Library
- S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In ICDT, pages 225--236, 2013. Google ScholarDigital Library
- G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469--478, 2012. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. J.R.Statist.Soc.B, 30(1):1--38, 1977.Google Scholar
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
- J. Fan, G. Li, B. C. Ooi, K. Tan, and J. Feng. icrowd: An adaptive crowdsourcing framework. In SIGMOD, pages 1015--1030, 2015. Google ScholarDigital Library
- J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In ICDE, pages 976--987. IEEE, 2014.Google ScholarCross Ref
- J. Fan, Z. Wei, D. Zhang, J. Yang, and X. Du. Distribution-aware crowdsourced entity collection. IEEE Trans. Knowl. Data Eng., 2017.Google Scholar
- J. Fan, M. Zhang, S. Kok, M. Lu, and B. C. Ooi. Crowdop: Query optimization for declarative crowdsourcing systems. TKDE, 27(8):2078--2092, 2015.Google ScholarDigital Library
- M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarDigital Library
- C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014. Google ScholarDigital Library
- S. Guo, A. G. Parameswaran, and H. Garcia-Molina. So who won?: dynamic max discovery with the crowd. In SIGMOD, 2012. Google ScholarDigital Library
- A. Gupta, M. Pál, R. Ravi, and A. Sinha. Boosted sampling: approximation algorithms for stochastic optimization. In ACM symposium on Theory of computing, pages 417--426, 2004. Google ScholarDigital Library
- C.-J. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classification. In ICML, pages 534--542, 2013. Google ScholarDigital Library
- C.-J. Ho and J. W. Vaughan. Online task assignment in crowdsourcing markets. In AAAI, 2012. Google ScholarDigital Library
- H. Hu, G. Li, Z. Bao, Y. Cui, and J. Feng. Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds. In ICDE, pages 883--894, 2016.Google ScholarCross Ref
- H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced poi labelling: Location-aware result inference and task assignment. In ICDE, 2016.Google ScholarCross Ref
- P. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In SIGKDD Workshop, 2010. Google ScholarDigital Library
- Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014. Google ScholarDigital Library
- M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Evaluating the crowd with confidence. In SIGKDD, pages 686--694, 2013. Google ScholarDigital Library
- G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. TKDE, 2015. Google ScholarDigital Library
- Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 2014. Google ScholarDigital Library
- Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. SIGKDD Explor. Newsl., 2016. Google ScholarDigital Library
- Q. Liu, J. Peng, and A. T. Ihler. Variational inference for crowdsourcing. In NIPS, pages 701--709, 2012. Google ScholarDigital Library
- X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 2012. Google ScholarDigital Library
- F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In KDD, 2015. Google ScholarDigital Library
- A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 6(2):109--120, 2012. Google ScholarDigital Library
- A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011. Google ScholarDigital Library
- A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, 2011.Google Scholar
- A. G. Parameswaran, S. Boyd, H. Garcia-Molina, A. Gupta, N. Polyzotis, and J. Widom. Optimal crowd-powered rating and filtering algorithms. PVLDB, 7(9):685--696, 2014. Google ScholarDigital Library
- A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarDigital Library
- A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: declarative crowdsourcing. In CIKM, 2012. Google ScholarDigital Library
- H. Park and J. Widom. Query optimization over crowdsourced data. PVLDB, 6(10):781--792, 2013. Google ScholarDigital Library
- L. Pournajaf, L. Xiong, V. Sunderam, and S. Goryczka. Spatial task assignment for crowd sensing with cloaked locations. In MDM, volume 1, pages 73--82. IEEE, 2014. Google ScholarDigital Library
- V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. JMLR, 2010. Google ScholarDigital Library
- S. B. Roy, I. Lykourentzou, S. Thirumuruganathan, S. Amer-Yahia, and G. Das. Task assignment optimization in knowledge-intensive crowdsourcing. The VLDB Journal, 24(4):467--491, 2015. Google ScholarDigital Library
- A. D. Sarma, A. G. Parameswaran, H. Garcia-Molina, and A. Y. Halevy. Crowd-powered find algorithms. In ICDE, 2014.Google Scholar
- C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1), Jan. 2001. Google ScholarDigital Library
- B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarDigital Library
- P. Venetis, H. Garcia-Molina, K. Huang, and N. Polyzotis. Max algorithms in crowdsourcing environments. In WWW, 2012. Google ScholarDigital Library
- N. Vesdapunt, K. Bellare, and N. N. Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB, 7(12):1071--1082, 2014. Google ScholarDigital Library
- J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarDigital Library
- J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013. Google ScholarDigital Library
- S. Wang, X. Xiao, and C. Lee. Crowd-based deduplication: An adaptive approach. In SIGMOD, pages 1263--1277, 2015. Google ScholarDigital Library
- P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In NIPS, 2010. Google ScholarDigital Library
- S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
- J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, 2009. Google ScholarDigital Library
- M. Yu, G. Li, D. Deng, and J. Feng. String similarity search and join: a survey. Frontiers of Computer Science, 10(3):399--417, 2016. Google ScholarDigital Library
- C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 2013. Google ScholarDigital Library
- C. J. Zhang, Y. Tong, and L. Chen. Where to: Crowd-aided path selection. PVLDB, 7(14):2005--2016, 2014. Google ScholarDigital Library
- X. Zhang, G. Li, and J. Feng. Crowdsourced top-k algorithms: An experimental evaluation. PVLDB, 9(8):612--623, 2016. Google ScholarDigital Library
- Y. Zheng, R. Cheng, S. Maniu, and L. Mo. On optimality of jury selection in crowdsourcing. In EDBT, pages 193--204, 2015.Google Scholar
- Y. Zheng, G. Li, and R. Cheng. DOCS: domain-aware crowdsourcing system. PVLDB, 10(4):361--372, 2016. Google ScholarDigital Library
- Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng. Truth inference in crowdsourcing: Is the problem solved? PVLDB, 10(5):541--552, 2017. Google ScholarDigital Library
- Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. QASCA: A quality-aware task assignment system for crowdsourcing applications. In SIGMOD, 2015. Google ScholarDigital Library
- D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In NIPS, 2012. Google ScholarDigital Library
Index Terms
- CDB: Optimizing Queries with Crowd-Based Selections and Joins
Recommendations
CrowdDB: answering queries with crowdsourcing
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataSome queries cannot be answered by machines only. Processing such queries requires human input for providing information that is missing from the database, for performing computationally difficult functions, and for matching, ranking, or aggregating ...
CDB: a crowd-powered database system
Crowd-powered database systems can leverage the crowd's ability to address machine-hard problems, e.g., data integration. Existing crowdsourcing systems adopt the traditional tree model to select a good query plan. However, the tree model can optimize ...
CrowdOp: Query Optimization for Declarative Crowdsourcing Systems
We study the query optimization problem in declarative crowdsourcing systems. Declarative crowdsourcing is designed to hide the complexities and relieve the user of the burden of dealing with the crowd. The user is only required to submit an SQL-like ...
Comments