research-article

CDB: Optimizing Queries with Crowd-Based Selections and Joins

Authors:
Guoliang Li

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Chengliang Chai

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Ju Fan

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China
View Profile

,
Xueping Weng

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jian Li

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Yudian Zheng

The University of Hong Kong, Hong Kong, China

The University of Hong Kong, Hong Kong, China
View Profile

,
Yuanbing Li

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Xiang Yu

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Xiaohang Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Haitao Yuan

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataMay 2017Pages 1463–1478https://doi.org/10.1145/3035918.3064036

Published:09 May 2017Publication History

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Pages 1463–1478

ABSTRACT

Crowdsourcing database systems have been proposed to leverage crowd-powered operations to encapsulate the complexities of interacting with the crowd. Existing systems suffer from two major limitations. Firstly, in order to optimize a query, they often adopt the traditional tree model to select an optimized table-level join order. However, the tree model provides a coarse-grained optimization, which generates the same order for different joined tuples and limits the optimization potential that different joined tuples can be optimized by different orders. Secondly, they mainly focus on optimizing the monetary cost. In fact, there are three optimization goals (i.e., smaller monetary cost, lower latency, and higher quality) in crowdsourcing, and it calls for a system to enable multi-goal optimization.

To address the limitations, we develop a crowd-powered database system CDB that supports crowd-based query optimizations, with focus on join and selection. CDB has fundamental differences from existing systems. First, CDB employs a graph-based query model that provides more fine-grained query optimization. Second, CDB adopts a unified framework to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on AMT, CrowdFlower and ChinaCrowd. We have also created a benchmark for evaluating crowd-powered databases. We have conducted both simulated and real experiments, and the experimental results demonstrate the performance superiority of CDB on cost, latency and quality.

References

Amazon mechanical turk. https://www.mturk.com/.Google Scholar
Chinacrowd. http://www.chinacrowds.com/.Google Scholar
Crowdflower. http://www.crowdflower.com.Google Scholar
Dbpedia. https://www.dbpedia.org.Google Scholar
Yago. https://www.mpi-inf.mpg.de.Google Scholar
A. Alfarrarjeh, T. Emrich, and C. Shahabi. Scalable spatial crowdsourcing: A study of distributed algorithms. In MDM, pages 134--144, 2015. Google ScholarDigital Library
Y. Amsterdamer, S. B. Davidson, T. Milo, S. Novgorodov, and A. Somech. Ontology assisted crowd mining. PVLDB, 7(13):1597--1600, 2014. Google ScholarDigital Library
Y. Amsterdamer, Y. Grossman, T. Milo, and P. Senellart. Crowd mining. In SIGMOD, pages 241--252. ACM, 2013. Google ScholarDigital Library
A.P.Dawid and A.M.Skene. Maximum likelihood estimation of observer error-rates using em algorithm. Appl.Statist., 28(1):20--28, 1979.Google ScholarCross Ref
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarDigital Library
R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis, and W.-C. Tan. Asking the right questions in crowd data sourcing. In ICDE'12. Google ScholarDigital Library
D. G. Brizan and A. U. Tansel. A. survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):5, 2015.Google ScholarCross Ref
C. Chai, G. Li, J. Li, D. Deng, and J. Feng. Cost-effective crowdsourced entity resolution: A partial-order approach. In SIGMOD, pages 969--984, 2016. Google ScholarDigital Library
X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In WSDM, pages 193--202, 2013. Google ScholarDigital Library
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261. ACM, 2015. Google ScholarDigital Library
S. B. Davidson, S. Khanna, T. Milo, and S. Roy. Using the crowd for top-k and group-by queries. In ICDT, pages 225--236, 2013. Google ScholarDigital Library
G. Demartini, D. E. Difallah, and P. Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In WWW, pages 469--478, 2012. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. J.R.Statist.Soc.B, 30(1):1--38, 1977.Google Scholar
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
J. Fan, G. Li, B. C. Ooi, K. Tan, and J. Feng. icrowd: An adaptive crowdsourcing framework. In SIGMOD, pages 1015--1030, 2015. Google ScholarDigital Library
J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In ICDE, pages 976--987. IEEE, 2014.Google ScholarCross Ref
J. Fan, Z. Wei, D. Zhang, J. Yang, and X. Du. Distribution-aware crowdsourced entity collection. IEEE Trans. Knowl. Data Eng., 2017.Google Scholar
J. Fan, M. Zhang, S. Kok, M. Lu, and B. C. Ooi. Crowdop: Query optimization for declarative crowdsourcing systems. TKDE, 27(8):2078--2092, 2015.Google ScholarDigital Library
M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011. Google ScholarDigital Library
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014. Google ScholarDigital Library
S. Guo, A. G. Parameswaran, and H. Garcia-Molina. So who won?: dynamic max discovery with the crowd. In SIGMOD, 2012. Google ScholarDigital Library
A. Gupta, M. Pál, R. Ravi, and A. Sinha. Boosted sampling: approximation algorithms for stochastic optimization. In ACM symposium on Theory of computing, pages 417--426, 2004. Google ScholarDigital Library
C.-J. Ho, S. Jabbari, and J. W. Vaughan. Adaptive task assignment for crowdsourced classification. In ICML, pages 534--542, 2013. Google ScholarDigital Library
C.-J. Ho and J. W. Vaughan. Online task assignment in crowdsourcing markets. In AAAI, 2012. Google ScholarDigital Library
H. Hu, G. Li, Z. Bao, Y. Cui, and J. Feng. Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds. In ICDE, pages 883--894, 2016.Google ScholarCross Ref
H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced poi labelling: Location-aware result inference and task assignment. In ICDE, 2016.Google ScholarCross Ref
P. Ipeirotis, F. Provost, and J. Wang. Quality management on amazon mechanical turk. In SIGKDD Workshop, 2010. Google ScholarDigital Library
Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014. Google ScholarDigital Library
M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Evaluating the crowd with confidence. In SIGKDD, pages 686--694, 2013. Google ScholarDigital Library
G. Li, J. Wang, Y. Zheng, and M. J. Franklin. Crowdsourced data management: A survey. TKDE, 2015. Google ScholarDigital Library
Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 2014. Google ScholarDigital Library
Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. A survey on truth discovery. SIGKDD Explor. Newsl., 2016. Google ScholarDigital Library
Q. Liu, J. Peng, and A. T. Ihler. Variational inference for crowdsourcing. In NIPS, pages 701--709, 2012. Google ScholarDigital Library
X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. CDAS: A crowdsourcing data analytics system. PVLDB, 2012. Google ScholarDigital Library
F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In KDD, 2015. Google ScholarDigital Library
A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh. Counting with the crowd. PVLDB, 6(2):109--120, 2012. Google ScholarDigital Library
A. Marcus, E. Wu, D. R. Karger, S. Madden, and R. C. Miller. Human-powered sorts and joins. PVLDB, 5(1):13--24, 2011. Google ScholarDigital Library
A. Marcus, E. Wu, S. Madden, and R. C. Miller. Crowdsourced databases: Query processing with people. In CIDR, 2011.Google Scholar
A. G. Parameswaran, S. Boyd, H. Garcia-Molina, A. Gupta, N. Polyzotis, and J. Widom. Optimal crowd-powered rating and filtering algorithms. PVLDB, 7(9):685--696, 2014. Google ScholarDigital Library
A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: algorithms for filtering data with humans. In SIGMOD, pages 361--372, 2012. Google ScholarDigital Library
A. G. Parameswaran, H. Park, H. Garcia-Molina, N. Polyzotis, and J. Widom. Deco: declarative crowdsourcing. In CIKM, 2012. Google ScholarDigital Library
H. Park and J. Widom. Query optimization over crowdsourced data. PVLDB, 6(10):781--792, 2013. Google ScholarDigital Library
L. Pournajaf, L. Xiong, V. Sunderam, and S. Goryczka. Spatial task assignment for crowd sensing with cloaked locations. In MDM, volume 1, pages 73--82. IEEE, 2014. Google ScholarDigital Library
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. JMLR, 2010. Google ScholarDigital Library
S. B. Roy, I. Lykourentzou, S. Thirumuruganathan, S. Amer-Yahia, and G. Das. Task assignment optimization in knowledge-intensive crowdsourcing. The VLDB Journal, 24(4):467--491, 2015. Google ScholarDigital Library
A. D. Sarma, A. G. Parameswaran, H. Garcia-Molina, and A. Y. Halevy. Crowd-powered find algorithms. In ICDE, 2014.Google Scholar
C. E. Shannon. A mathematical theory of communication. SIGMOBILE Mob. Comput. Commun. Rev., 5(1), Jan. 2001. Google ScholarDigital Library
B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar. Crowdsourced enumeration queries. In ICDE, pages 673--684, 2013. Google ScholarDigital Library
P. Venetis, H. Garcia-Molina, K. Huang, and N. Polyzotis. Max algorithms in crowdsourcing environments. In WWW, 2012. Google ScholarDigital Library
N. Vesdapunt, K. Bellare, and N. N. Dalvi. Crowdsourcing algorithms for entity resolution. PVLDB, 7(12):1071--1082, 2014. Google ScholarDigital Library
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. CrowdER: crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarDigital Library
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, 2013. Google ScholarDigital Library
S. Wang, X. Xiao, and C. Lee. Crowd-based deduplication: An adaptive approach. In SIGMOD, pages 1263--1277, 2015. Google ScholarDigital Library
P. Welinder, S. Branson, P. Perona, and S. J. Belongie. The multidimensional wisdom of crowds. In NIPS, 2010. Google ScholarDigital Library
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, 2013. Google ScholarDigital Library
J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, 2009. Google ScholarDigital Library
M. Yu, G. Li, D. Deng, and J. Feng. String similarity search and join: a survey. Frontiers of Computer Science, 10(3):399--417, 2016. Google ScholarDigital Library
C. J. Zhang, L. Chen, H. V. Jagadish, and C. C. Cao. Reducing uncertainty of schema matching via crowdsourcing. PVLDB, 2013. Google ScholarDigital Library
C. J. Zhang, Y. Tong, and L. Chen. Where to: Crowd-aided path selection. PVLDB, 7(14):2005--2016, 2014. Google ScholarDigital Library
X. Zhang, G. Li, and J. Feng. Crowdsourced top-k algorithms: An experimental evaluation. PVLDB, 9(8):612--623, 2016. Google ScholarDigital Library
Y. Zheng, R. Cheng, S. Maniu, and L. Mo. On optimality of jury selection in crowdsourcing. In EDBT, pages 193--204, 2015.Google Scholar
Y. Zheng, G. Li, and R. Cheng. DOCS: domain-aware crowdsourcing system. PVLDB, 10(4):361--372, 2016. Google ScholarDigital Library
Y. Zheng, G. Li, Y. Li, C. Shan, and R. Cheng. Truth inference in crowdsourcing: Is the problem solved? PVLDB, 10(5):541--552, 2017. Google ScholarDigital Library
Y. Zheng, J. Wang, G. Li, R. Cheng, and J. Feng. QASCA: A quality-aware task assignment system for crowdsourcing applications. In SIGMOD, 2015. Google ScholarDigital Library
D. Zhou, S. Basu, Y. Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In NIPS, 2012. Google ScholarDigital Library

Index Terms

CDB: Optimizing Queries with Crowd-Based Selections and Joins
1. Information systems
  1. World Wide Web
    1. Web applications
      1. Crowdsourcing

Recommendations

CrowdDB: answering queries with crowdsourcing
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Some queries cannot be answered by machines only. Processing such queries requires human input for providing information that is missing from the database, for performing computationally difficult functions, and for matching, ranking, or aggregating ...
Read More
CDB: a crowd-powered database system

Crowd-powered database systems can leverage the crowd's ability to address machine-hard problems, e.g., data integration. Existing crowdsourcing systems adopt the traditional tree model to select a good query plan. However, the tree model can optimize ...
Read More
CrowdOp: Query Optimization for Declarative Crowdsourcing Systems
We study the query optimization problem in declarative crowdsourcing systems. Declarative crowdsourcing is designed to hide the complexities and relieve the user of the burden of dealing with the crowd. The user is only required to submit an SQL-like ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
May 2017
1810 pages
ISBN:9781450341974
DOI:10.1145/3035918
General Chairs:
Rada Chirkova
North Carolina State University, USA
,
Jun Yang
Duke University, USA
,
Program Chair:
Dan Suciu
University of Washington, USA
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crowd-based join
crowd-based selection
crowdsourcing
crowdsourcing optimization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 658
  Total Downloads
- Downloads (Last 12 months)91
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CDB: Optimizing Queries with Crowd-Based Selections and Joins

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

CrowdDB: answering queries with crowdsourcing

CDB: a crowd-powered database system

CrowdOp: Query Optimization for Declarative Crowdsourcing Systems