skip to main content
10.5555/1065226.1065247acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesdg-oConference Proceedingsconference-collections
Article

Near-duplicate detection for eRulemaking

Published: 15 May 2005 Publication History

Abstract

U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing regulations. In recent years, agencies have begun to accept comments via both email and Web forms. The transition from paper to electronic comments makes it much easier for individuals to customize "form" letters, which they do, creating "near-duplicate" comments that express the same viewpoint in slightly different languages. This paper explores the use of simple text clustering and retrieval algorithms for identifying near-duplicate public comments. Experiments with public comments about a recent regulation proposed by the Environmental Protection Agency (EPA) demonstrate the effectiveness of the algorithms.

References

[1]
M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003.
[2]
S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398--409. ACM Press, May 1995.
[3]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 '97, pages 391--404. Elsevier Science, April 1997.
[4]
J. Callan, eRulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004
[5]
J. Carletta. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249--254, 1996.
[6]
A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002.
[7]
J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37--46, 1960.
[8]
J. Conrad and C. P. Schriber. Constructing a Text Corpus for Inexact Duplicate Detection. In Proceedings of ACM SIGIR'04, Sheffield, South Yorkshire, UK. July 25--29, 2004
[9]
J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM'03, pages 443--452. ACM Press, Nov. 2003.
[10]
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3):107--145, 2001.
[11]
N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191--200, Nov. 1996.
[12]
T. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. In Journal of the American Society for Information Science and Technology, Volume 54, Issue 3, 2003.
[13]
P. Laplace. Philosophical essay on probabilistic. New York: Springer-Verlag. 1995.
[14]
W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003
[15]
S. Shulman. An experiment in digital government and the United States National Organic Program. Agriculture and Human Values. 2003
[16]
N. Shrivakumar and H. Garcia-Molina. Finding near-replicas of documents on the Web. In Proceedings of Workshop on Web Databases (WebDB '98), pages 204--212, March 1998.
[17]
T. Yan and H. Gracia-Molina. Duplicate removal in information dissemination. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95), 1995.

Cited By

View all
  • (2018)E-RulemakingInternational Journal of Technology and Human Interaction10.4018/IJTHI.201804010314:2(35-53)Online publication date: 1-Apr-2018
  • (2015)Introducing textual analysis tools for policy informaticsProceedings of the 16th Annual International Conference on Digital Government Research10.1145/2757401.2757421(10-19)Online publication date: 27-May-2015
  • (2013)Detecting near-duplicate documents using sentence-level features and supervised learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2012.08.04540:5(1467-1476)Online publication date: 1-Apr-2013
  • Show More Cited By

Index Terms

  1. Near-duplicate detection for eRulemaking

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    dg.o '05: Proceedings of the 2005 national conference on Digital government research
    May 2005
    328 pages

    Sponsors

    • NSF: National Science Foundation

    Publisher

    Digital Government Society of North America

    Publication History

    Published: 15 May 2005

    Check for updates

    Author Tags

    1. eRulemaking
    2. information retrieval
    3. near duplicate detection
    4. public comments

    Qualifiers

    • Article

    Conference

    dg.o '05
    Sponsor:
    • NSF
    dg.o '05: Digital government research
    May 15 - 18, 2005
    Georgia, Atlanta, USA

    Acceptance Rates

    Overall Acceptance Rate 150 of 271 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)E-RulemakingInternational Journal of Technology and Human Interaction10.4018/IJTHI.201804010314:2(35-53)Online publication date: 1-Apr-2018
    • (2015)Introducing textual analysis tools for policy informaticsProceedings of the 16th Annual International Conference on Digital Government Research10.1145/2757401.2757421(10-19)Online publication date: 27-May-2015
    • (2013)Detecting near-duplicate documents using sentence-level features and supervised learningExpert Systems with Applications: An International Journal10.1016/j.eswa.2012.08.04540:5(1467-1476)Online publication date: 1-Apr-2013
    • (2011)Reuse in the wildProceedings of the SIGCHI Conference on Human Factors in Computing Systems10.1145/1978942.1979370(2877-2886)Online publication date: 7-May-2011
    • (2009)Disambiguating authors in academic publications using random forestsProceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries10.1145/1555400.1555408(39-48)Online publication date: 15-Jun-2009
    • (2008)A study in rule-specific issue categorization for e-rulemakingProceedings of the 2008 international conference on Digital government research10.5555/1367832.1367874(244-253)Online publication date: 18-May-2008
    • (2008)Active learning for e-rulemakingProceedings of the 2008 international conference on Digital government research10.5555/1367832.1367873(234-243)Online publication date: 18-May-2008
    • (2007)A bootstrapping approach for identifying stakeholders in public-comment corporaProceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains10.5555/1248460.1248475(92-101)Online publication date: 20-May-2007
    • (2007)Identifying and classifying subjective claimsProceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains10.5555/1248460.1248473(76-81)Online publication date: 20-May-2007
    • (2006)Get out the voteProceedings of the 2006 Conference on Empirical Methods in Natural Language Processing10.5555/1610075.1610122(327-335)Online publication date: 22-Jul-2006
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media