skip to main content
10.1145/3035918.3064036acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

CDB: Optimizing Queries with Crowd-Based Selections and Joins

Published:09 May 2017Publication History

ABSTRACT

Crowdsourcing database systems have been proposed to leverage crowd-powered operations to encapsulate the complexities of interacting with the crowd. Existing systems suffer from two major limitations. Firstly, in order to optimize a query, they often adopt the traditional tree model to select an optimized table-level join order. However, the tree model provides a coarse-grained optimization, which generates the same order for different joined tuples and limits the optimization potential that different joined tuples can be optimized by different orders. Secondly, they mainly focus on optimizing the monetary cost. In fact, there are three optimization goals (i.e., smaller monetary cost, lower latency, and higher quality) in crowdsourcing, and it calls for a system to enable multi-goal optimization.

To address the limitations, we develop a crowd-powered database system CDB that supports crowd-based query optimizations, with focus on join and selection. CDB has fundamental differences from existing systems. First, CDB employs a graph-based query model that provides more fine-grained query optimization. Second, CDB adopts a unified framework to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on AMT, CrowdFlower and ChinaCrowd. We have also created a benchmark for evaluating crowd-powered databases. We have conducted both simulated and real experiments, and the experimental results demonstrate the performance superiority of CDB on cost, latency and quality.

References

  1. Amazon mechanical turk. https://www.mturk.com/.Google ScholarGoogle Scholar
  2. Chinacrowd. http://www.chinacrowds.com/.Google ScholarGoogle Scholar
  3. Crowdflower. http://www.crowdflower.com.Google ScholarGoogle Scholar
  4. Dbpedia. https://www.dbpedia.org.Google ScholarGoogle Scholar
  5. Yago. https://www.mpi-inf.mpg.de.Google ScholarGoogle Scholar
  6. A. Alfarrarjeh, T. Emrich, and C. Shahabi. Scalable spatial crowdsourcing: A study of distributed algorithms. In MDM, pages 134--144, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Amsterdamer, S. B. Davidson, T. Milo, S. Novgorodov, and A. Somech. Ontology assisted crowd mining. PVLDB, 7(13):1597--1600, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, pages 241--252. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A.P.Dawid and A.M.Skene. Maximum likelihood estimation of observer error-rates using em algorithm. Appl.Statist., 28(1):20--28, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  10. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and W.-C. Tan. Asking the right questions in crowd data sourcing. In ICDE'12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. G. Brizan and A. U. Tansel. A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  13. C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD, pages 969--984, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In WSDM, pages 193--202, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In ICDT, pages 225--236, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469--478, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. J.R.Statist.Soc.B, 30(1):1--38, 1977.Google ScholarGoogle Scholar
  19. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Fan, G. Li, B. C. Ooi, K. Tan, and J. Feng. icrowd: An adaptive crowdsourcing framework. In SIGMOD, pages 1015--1030, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In ICDE, pages 976--987. IEEE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Fan, Z. Wei, D. Zhang, J. Yang, and X. Du. Distribution-aware crowdsourced entity collection. IEEE Trans. Knowl. Data Eng., 2017.Google ScholarGoogle Scholar
  23. J. Fan, M. Zhang, S. Kok, M. Lu, and B. C. Ooi. Crowdop: Query optimization for declarative crowdsourcing systems. TKDE, 27(8):2078--2092, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Guo, A. G. Parameswaran, and H. Garcia-Molina. So who won?: dynamic max discovery with the crowd. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Gupta, M. Pál, R. Ravi, and A. Sinha. Boosted sampling: approximation algorithms for stochastic optimization. In ACM symposium on Theory of computing, pages 417--426, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C.-J. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classification. In ICML, pages 534--542, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C.-J. Ho and J. W. Vaughan. Online task assignment in crowdsourcing markets. In AAAI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Hu, G. Li, Z. Bao, Y. Cui, and J. Feng. Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds. In ICDE, pages 883--894, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  31. H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced poi labelling: Location-aware result inference and task assignment. In ICDE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  32. P. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In SIGKDD Workshop, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Evaluating the crowd with confidence. In SIGKDD, pages 686--694, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. TKDE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. SIGKDD Explor. Newsl., 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Q. Liu, J. Peng, and A. T. Ihler. Variational inference for crowdsourcing. In NIPS, pages 701--709, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In KDD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 6(2):109--120, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, 2011.Google ScholarGoogle Scholar
  44. A. G. Parameswaran, S. Boyd, H. Garcia-Molina, A. Gupta, N. Polyzotis, and J. Widom. Optimal crowd-powered rating and filtering algorithms. PVLDB, 7(9):685--696, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: declarative crowdsourcing. In CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Park and J. Widom. Query optimization over crowdsourced data. PVLDB, 6(10):781--792, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. L. Pournajaf, L. Xiong, V. Sunderam, and S. Goryczka. Spatial task assignment for crowd sensing with cloaked locations. In MDM, volume 1, pages 73--82. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. JMLR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. B. Roy, I. Lykourentzou, S. Thirumuruganathan, S. Amer-Yahia, and G. Das. Task assignment optimization in knowledge-intensive crowdsourcing. The VLDB Journal, 24(4):467--491, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. A. D. Sarma, A. G. Parameswaran, H. Garcia-Molina, and A. Y. Halevy. Crowd-powered find algorithms. In ICDE, 2014.Google ScholarGoogle Scholar
  52. C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1), Jan. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. P. Venetis, H. Garcia-Molina, K. Huang, and N. Polyzotis. Max algorithms in crowdsourcing environments. In WWW, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. N. Vesdapunt, K. Bellare, and N. N. Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB, 7(12):1071--1082, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Wang, X. Xiao, and C. Lee. Crowd-based deduplication: An adaptive approach. In SIGMOD, pages 1263--1277, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In NIPS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. M. Yu, G. Li, D. Deng, and J. Feng. String similarity search and join: a survey. Frontiers of Computer Science, 10(3):399--417, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. C. J. Zhang, Y. Tong, and L. Chen. Where to: Crowd-aided path selection. PVLDB, 7(14):2005--2016, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. X. Zhang, G. Li, and J. Feng. Crowdsourced top-k algorithms: An experimental evaluation. PVLDB, 9(8):612--623, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Y. Zheng, R. Cheng, S. Maniu, and L. Mo. On optimality of jury selection in crowdsourcing. In EDBT, pages 193--204, 2015.Google ScholarGoogle Scholar
  67. Y. Zheng, G. Li, and R. Cheng. DOCS: domain-aware crowdsourcing system. PVLDB, 10(4):361--372, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng. Truth inference in crowdsourcing: Is the problem solved? PVLDB, 10(5):541--552, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. QASCA: A quality-aware task assignment system for crowdsourcing applications. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In NIPS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CDB: Optimizing Queries with Crowd-Based Selections and Joins

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
      May 2017
      1810 pages
      ISBN:9781450341974
      DOI:10.1145/3035918

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 May 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader