19
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Scalable K-Means++

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

          Related collections

          Most cited references14

          • Record: found
          • Abstract: not found
          • Article: not found

          BIRCH

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            NP-hardness of Euclidean sum-of-squares clustering

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Clustering data streams: theory and practice

                Bookmark

                Author and article information

                Journal
                28 March 2012
                Article
                1203.6402
                a75216e5-a81f-4a2d-a09c-4f05a626a2ad

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 7, pp. 622-633 (2012)
                VLDB2012
                cs.DB
                Ahmet Sacan

                Comments

                Comment on this article