Polylogarithmic Approximation for Edit Distance and the Asymmetric Query
  Complexity

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed epsilon>0, it can compute a (log n)^O(1/epsilon) approximation in n^(1+epsilon) time. This is an exponential improvement over the previously known factor, 2^(O (sqrt(log n))), with a comparable running time (Ostrovsky and Rabani J.ACM 2007; Andoni and Onak STOC 2009). Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching). This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being "repetitive", which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on non-repetitive strings, such as permutations.

Related collections

Most cited references 2

Record: found
Abstract: not found
Article: not found

Bounding the Expected Length of Longest Common Subsequences and Forests

R. A. Baeza-Yates, R. Gavaldà, G. Navarro, R Scheihing (1999)

0 comments Cited 5 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Improved bounds on the average length of longest common subsequences

George S. Lueker (2009)

It has long been known [Chvátal and Sankoff 1975] that the average length of the longest common subsequence of two random strings of length n over an alphabet of size k is asymptotic to γ k n for some constant γ k depending on k . The value of these constants remains unknown, and a number of papers have proved upper and lower bounds on them. We discuss techniques, involving numerical calculations with recurrences on many variables, for determining lower and upper bounds on these constants. To our knowledge, the previous best-known lower and upper bounds for γ 2 were those of Dančík and Paterson, approximately 0.773911 and 0.837623 [Dančík 1994; Dančík and Paterson 1995]. We improve these to 0.788071 and 0.826280. This upper bound is less than the γ 2 given by Steele's old conjecture (see Steele [1997, page 3]) that γ 2 = 2/(1 + √2)≈ 0.828427. (As Steele points out, experimental evidence had already suggested that this conjectured value was too high.) Finally, we show that the upper bound technique described here could be used to produce, for any k , a sequence of upper bounds converging to γ k , though the computation time grows very quickly as better bounds are guaranteed.