30
views
0
recommends
+1 Recommend
0 collections
    8
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Polylogarithmic Approximation for Edit Distance and the Asymmetric Query Complexity

      Preprint
      , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We present a near-linear time algorithm that approximates the edit distance between two strings within a polylogarithmic factor; specifically, for strings of length n and every fixed epsilon>0, it can compute a (log n)^O(1/epsilon) approximation in n^(1+epsilon) time. This is an exponential improvement over the previously known factor, 2^(O (sqrt(log n))), with a comparable running time (Ostrovsky and Rabani J.ACM 2007; Andoni and Onak STOC 2009). Previously, no efficient polylogarithmic approximation algorithm was known for any computational task involving edit distance (e.g., nearest neighbor search or sketching). This result arises naturally in the study of a new asymmetric query model. In this model, the input consists of two strings x and y, and an algorithm can access y in an unrestricted manner, while being charged for querying every symbol of x. Indeed, we obtain our main result by designing an algorithm that makes a small number of queries in this model. We then provide a nearly-matching lower bound on the number of queries. Our lower bound is the first to expose hardness of edit distance stemming from the input strings being "repetitive", which means that many of their substrings are approximately identical. Consequently, our lower bound provides the first rigorous separation between edit distance and Ulam distance, which is edit distance on non-repetitive strings, such as permutations.

          Related collections

          Most cited references2

          • Record: found
          • Abstract: not found
          • Article: not found

          Bounding the Expected Length of Longest Common Subsequences and Forests

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Improved bounds on the average length of longest common subsequences

            It has long been known [Chvátal and Sankoff 1975] that the average length of the longest common subsequence of two random strings of length n over an alphabet of size k is asymptotic to γ k n for some constant γ k depending on k . The value of these constants remains unknown, and a number of papers have proved upper and lower bounds on them. We discuss techniques, involving numerical calculations with recurrences on many variables, for determining lower and upper bounds on these constants. To our knowledge, the previous best-known lower and upper bounds for γ 2 were those of Dančík and Paterson, approximately 0.773911 and 0.837623 [Dančík 1994; Dančík and Paterson 1995]. We improve these to 0.788071 and 0.826280. This upper bound is less than the γ 2 given by Steele's old conjecture (see Steele [1997, page 3]) that γ 2 = 2/(1 + √2)≈ 0.828427. (As Steele points out, experimental evidence had already suggested that this conjectured value was too high.) Finally, we show that the upper bound technique described here could be used to produce, for any k , a sequence of upper bounds converging to γ k , though the computation time grows very quickly as better bounds are guaranteed.
              Bookmark

              Author and article information

              Journal
              21 May 2010
              Article
              1005.4033
              cacb997c-b47d-4494-9d30-17b0f66e54b1

              http://arxiv.org/licenses/nonexclusive-distrib/1.0/

              History
              Custom metadata
              cs.DS

              Comments

              Comment on this article