ScienceOpen: research and publishing network

For Researchers

Search
Advanced search

19

views

    

0

recommends

0

shares

Record: found
Abstract: found
Article: found

Is Open Access

Scalable K-Means++

Preprint

Author(s): Bahman Bahmani , Benjamin Moseley , Andrea Vattani , Ravi Kumar , Sergei Vassilvitskii

Publication date Created: 28 March 2012

Read this article at

ScienceOpen ArXiv

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

Related collections

Most cited references 14

Record: found
Abstract: not found
Article: not found

BIRCH

Tian Y. Zhang, Raghu Ramakrishnan, Miron Livny (1996)

0 comments Cited 191 times – based on 0 reviews      Review now

Record: found
Abstract: not found
Article: not found

NP-hardness of Euclidean sum-of-squares clustering

Daniel Aloise, Amit Deshpande, Pierre Hansen … (2009)

0 comments Cited 107 times – based on 0 reviews      Review now

Record: found
Abstract: not found
Article: not found

Clustering data streams: theory and practice

N. Mishra, L O'Callaghan, S Guha … (2003)

0 comments Cited 77 times – based on 0 reviews      Review now

Author and article information

Journal

Publication date Created: 28 March 2012

Article

ArXiV ID: 1203.6402

SO-VID: a75216e5-a81f-4a2d-a09c-4f05a626a2ad

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Journal reference Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 7, pp. 622-633 (2012)

Comments VLDB2012

Categories cs.DB

Proxy Ahmet Sacan

Data availability:

Comments

Comment on this article

Similar content 16

See all similar

Cited by 28

See all cited by

Most referenced authors 231

See all reference authors

- Version 1