skip to main content
article

Low distortion embeddings for edit distance

Published: 01 October 2007 Publication History

Abstract

We show that {0, 1}d endowed with edit distance embeds into ℓ1 with distortion 2O(√log d log log d). We further show efficient implementation of the embedding that yield solutions to various computational problems involving edit distance. These include sketching, communication complexity, nearest neighbor search. For all these problems, we improve upon previous bounds.

References

[1]
Andoni, A., Deza, M., Gupta, A., Indyk, P., and Raskhodnikova, S. 2003. Lower bounds for embedding edit distance into normed spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, ACM, New York, 523--526.
[2]
Bar-Yossef, Z., Jayram, T. S., Krauthgamer, R., and Kumar, R. 2004. Approximating edit distance efficiently. In Proceedings of the Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA.
[3]
Batu, T., Ergun, F., Kilian, J., Magen, A., Raskhodnikova, S., Rubinfeld, R. and Sami, R. 2003. A sublinear algorithm for weakly approximating edit distance. In Proceedings of the Symposium on Theory of Computing. ACM, New York.
[4]
Broder, A., Glassman, S., Manasse, M., and Zweig, G. 1997. Syntactic clustering of the Web. In Proc. of the 6th International World Wide Web Conference, 391--404.
[5]
Cormode, G., and Muthukrishnan, S. 2002. The string edit distance matching problem with moves. In Proceedings of the Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA. 667--676.
[6]
Cormode, G., Paterson, M., Sahinalp, C. S., and Vishkin, U. 2000. Communication complexity of document exchange. In Proceedings of the Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA.
[7]
Indyk, P. 2004. Approximate nearest neighbor under edit distance via product metrics. In Proceedings of the Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA. 646--650.
[8]
Indyk, P., and Motwani, R. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Symposium on Theory of Computing. ACM, New York. 604--613.
[9]
Khot, S., and Naor, A. 2005. Nonembeddability theorems via Fourier analysis. In Proceedings of the Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, CA.
[10]
Krauthgamer, R., and Rabani, Y. 2006. Improved lower bounds for embeddings into L1. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, ACM, New York.
[11]
Kushilevitz, E., Ostrovsky, R., and Rabani, Y. 2000. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2, 457--474. (Preliminary version appeared in Proceedings of the Symposium on Theory of Computing. ACM, New York.)
[12]
Levenshtein, V. I. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163, 4, 845--848 (Russian). (English translation in Soviet Physics Doklady 10, 8, 707--710, 1996.)
[13]
Masek, W. J., and Paterson, M. S. 1980. A faster algorithm for computing string edit distnace. J. Comput. Syst. Sci. 20, 1, 18--31.
[14]
Muthukrishnan, S., and Sahinalp, S. C. 2000. Approximate nearest neighbors and sequence comparison with block operations. In Proceedings of the Symposium on Theory of Computing. ACM, New York., 416--424.

Cited By

View all
  • (2024)Levenshtein distance embedding with poisson regression for DNA storageProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i14.29509(15796-15804)Online publication date: 20-Feb-2024
  • (2024)Neural embeddings for kNN search in biological sequenceProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i1.27753(38-45)Online publication date: 20-Feb-2024
  • (2024)Almost Linear Size Edit Distance SketchProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649783(956-967)Online publication date: 10-Jun-2024
  • Show More Cited By

Recommendations

Reviews

Constantin S Chassapis

Use what you understand well where you know little, by transforming that about which you know little into something embeddable in the well-understood corpus. Do the transformation with the lowest distortion possible. This is the heart of this important paper, applied to string and document comparison. The paper uses the language of discrete mathematics. It is not easy to read, but is certainly worth reading. L1 is a distance (metric) between two entities equal to the sum of the absolute values of the differences between corresponding coordinates of the entities. Many good algorithms exist for it. The edit or Levenstein distance measures the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. The edit distance is well defined even for strings of varying lengths. Improving the algorithms that use this distance is an important endeavor for both theoretical and practical reasons. To go from one approach to the other, one needs to transform the strings to appropriate entities that can admit numerical coordinates. This is an embedding. The issue is what and how to embed—in order to preserve, with the lowest distortion possible—the edit distance-based metric in the process. The authors capitalized on earlier observations that if edit distance is allowed to handle block operations (that is, large substrings in a single operation), then the embeddings have low distortion. They devise a procedure that performs a novel, efficient type of embedding that further lowers the distortion previously known in the literature: the two initial strings are split into well-defined sets of substrings and these collections of substrings, appropriately compared to preserve the edit distance character, generate numbers (coordinates); the authors prove that these can be used in an L1 approach to compare the two initial strings. Their methodology uses in part a probabilistic approach in order to make the procedure efficient. The paper would be useful to people studying this subject theoretically, and also to algorithm creators and industry specialists. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 54, Issue 5
October 2007
162 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1284320
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2007
Published in JACM Volume 54, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Levenshtein distance
  2. Pattern matching
  3. communication complexity
  4. computations on discrete structures
  5. dimension reduction
  6. edit distance
  7. metric embeddings
  8. nearest neighbor search
  9. sketching

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Levenshtein distance embedding with poisson regression for DNA storageProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i14.29509(15796-15804)Online publication date: 20-Feb-2024
  • (2024)Neural embeddings for kNN search in biological sequenceProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i1.27753(38-45)Online publication date: 20-Feb-2024
  • (2024)Almost Linear Size Edit Distance SketchProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649783(956-967)Online publication date: 10-Jun-2024
  • (2024)Learning locality-sensitive bucketing functionsBioinformatics10.1093/bioinformatics/btae22840:Supplement_1(i318-i327)Online publication date: 28-Jun-2024
  • (2023)Locality-sensitive bucketing functions for the edit distanceAlgorithms for Molecular Biology10.1186/s13015-023-00234-218:1Online publication date: 24-Jul-2023
  • (2023)On the Maximal Independent Sets of k-mers with the Edit DistanceProceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3584371.3612982(1-6)Online publication date: 3-Sep-2023
  • (2023)Locally Consistent Decomposition of Strings with Applications to Edit Distance SketchingProceedings of the 55th Annual ACM Symposium on Theory of Computing10.1145/3564246.3585239(219-232)Online publication date: 2-Jun-2023
  • (2023)EdtClust: A fast homologous protein sequences clustering method based on edit distance2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM58861.2023.10385950(690-697)Online publication date: 5-Dec-2023
  • (2022)Almost-optimal sublinear-time edit distance in the low distance regimeProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3519990(1102-1115)Online publication date: 9-Jun-2022
  • (2021)Neural distance embeddings for biological sequencesProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3541678(18539-18551)Online publication date: 6-Dec-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media