K-means Clustering Algorithm: Efficient Implementation on Graphics Processing Units

Data analysis and classification play a big role in understanding various real life phenomena. Clustering helps analyze data with little or no prior knowledge about it. K-means clustering is a popular clustering algorithm with applications to computer vision, data mining, data visualization, etc.. Due to continuously increasing data volume, parallel computing is necessary to overcome the computational challenges involved in K-means clustering. We present the design and implementation of Kmeans clustering algorithm on widely available graphics processing units (GPUs), which have the required hardware architecture to meet these parallelism needs. We analyze the scalability of our proposed methods with increase in number and dimensionality of data points as well as the number of clusters. We also compare our results with current best available implementations on GPUs and a 24-way threaded parallel CPU implementation. We achieved a consistent speedup of 6.5x over the parallel CPU implementation.

Li et al [17], University of Virginia [4] and GPUMiner [9] for 100 iterations on KDD Cup 1999 Dataset [15], with K = 32, d = 34,  [17]. K = 32, n = 51200, number of iterations= 100. 42 Introduction 1.1 Introduction "Understanding our world requires conceptualizing the similarities and differences between the entities that compose it" [32]. For understanding phenomena in real life there is always a need to aggregate all the raw data and then perform analysis over it. Various methodologies have been adopted for analyzing the raw data both for performing supervised and unsupervised learning. Clustering is one such technique, an unsupervised classification or exploratory data analysis without availability of any labeled data [35]. It aids in organizing the data instances into an efficient representation that characterizes the data being sampled.
Formally, clustering groups data instances into subsets in such a manner that similar instances are grouped together, while different instances belong to different groups [28]. The goal of clustering is to separate a finite, unlabeled data set into a finite and discrete set of "natural", hidden data structures, rather than to provide an accurate characterization of unobserved samples generated from the same probability distribution [2,6].
There are a large number of clustering algorithms and heuristics that have been developed over the last many decades [13,28,35]. Since these clustering algorithms have to analyze real world data, they often have to deal with huge datasets that 1 CHAPTER 1. INTRODUCTION 2 require high processing power. Over the last few years, Graphics Processing Units (GPUs) have emerged as a new alternative for handling operations with high levels of parallelism [16]. Also since the data operations in clustering algorithms are largely independent and compute-intensive, the large number of cores available on GPUs offer a more natural alternative for extracting maximum parallelism and efficiency [4].
In our work, we will be looking at design and implementation of K-means clustering algorithm on GPUs using a general-purpose parallel programming model, namely Compute Unified Device Architecture (CUDA) [24]. We analyze the scalability of our proposed methods with increase in number and dimensionality of data points as well as the number of clusters. We also compare our results with current best available implementations on GPUs and 24-way threaded parallel CPU implementations.
Because of its efficiency in clustering large data sets it is one of the most popular and commonly used partitional algorithm [14]. It has been successfully used for analysis in variety of different fields like economics (market segmentation), life and medical sciences (genetics, microbiology), engineering (machine learning), astronomy, earth sciences and sociology (behavior pattern discovery) [35]. Technology introduced GPUMiner suite [9], which contained parallel data mining implementations on graphics processors. They reported 5x improvement over the first work [5] done by University of Virgina but it was still slower than their later implementation [4].
In 2009, a team of researchers at HP labs further improved these results [34].
They achieved upto 4x speedup over university of Virgina's implementation [4] and 20x to 70x speedup over GPUMiner [9]. This work was further improved by Li et al [17] in 2010 by making separate K-means implementations for low dimensional and higher dimensional input data sets. In 2011, Wasif et al [33] reported 2x improvement over work of Li et al [17]. They have also reported 70x speedup on Fermi architecture based GTX480 GPU over a 32 bit Intel Core 2 duo processor.

Outline
The rest of the thesis is organized as follows: In Chapter 2, we give an introduction to K-means clustering algorithm. We illustrate its different stages and then analyze its complexity and scalability. Also, we take a look into NU-MineBench benchmark suite [26]  Even though K-means was first proposed over fifty years ago, it is still one of the most widely used algorithms for clustering. Ease of implementation, simplicity, efficiency, and empirical success are the main reasons for its popularity. The main features of the conventional K-means clustering algorithm [12] are as follows.
1. It is a partitional or non-hierarchical clustering method.
2. The total number of clusters, k, is assumed to be fixed and known before hand.
3. It uses Euclidean distance as the criterion for finding distance between the data points and hence, is only applicable to numerical data.
4. It is a greedy iterative method and the iterations continue till the error function has not reached a threshold or the membership of the data points no longer changes.

Algorithm
Let D = {x i , i = 1, ..., n} be the given input data set of d dimensional data points.
Let K be the total number of disjoint clusters, denoted by C = {c k , k = 1, ..., K}.
For each cluster c k let µ (c k ) be its centroid or mean. Then the error function is defined as The aim of K-means is to minimize the value of the error function E (C)) which is an NP-hard problem [8].
The pseudo code for conventional K-means algorithm is shown in Algorithm 1.

Algorithm 1 K-means Algorithm
Input Dataset D containing n d-dimensional points, Number of clusters K, Error threshold E t Output K disjoint clusters, C = {c k , k = 1, ..., K} with their members.
1. {c 1 , c 2 , ..., c k } be the initial K partitions. for each data-point x in D do

4.
Let x belongs to cluster c 1 .
end for 10. end for

11.
Recompute the cluster centroids based on above assignments.

12.
Calculate the error function E (C). 13. until either there were no changes in cluster membership or value of E (C) < E t .

Analysis
K-means algorithm tries to terminate at a local optimum. It always converges to a local minimum point [29] after a finite number of iterations and so the value of error function E (C) is always non-increasing. The most important choice for K-means algorithm is the choice of value K. Also, the final set of clusters and the number of iterations taken to reach them is dependent on the initially selected centroids (Step 1 in Algorithm 1). Figure 6.2 illustrates the execution of K-means Algorithm 1.
We consider the case where the input data points are two-dimensional and the value of K is 3. Figure 2.1a shows the initialization phase (Step 1) of the algorithm.  Three random data-points have been selected as the initial centroids. In the first iteration, the assignment stage (Step 3) assigns the data points to the initial centroids based on their Euclidean distance with the initially selected centroids as depicted in function E (C) keeps on decreasing with each iteration [29].
Finally, in the third and last iteration assignment stage does not produce any reassignments (Figure 2.1f). Since this was our termination condition (Step 13), the algorithm exits and outputs the final clusters with their members. For large input data sets, values of d and K are much smaller as compared to n. As a result, we can consider this step to be linear in terms of n.
Also, the extra space required is for storing the membership of the data points which anyways is required in the output and is again O(n). Hence, the time and space complexity have a linear increase with increase in the size of input data set, making it highly scalable.

Computing new centroids
While computing new centroids, value µ is calculated for each of the K clusters.
This requires taking mean of all the members for every cluster. Since these clusters are disjoint, mean calculation of all the K clusters has a time complexity of O(nd).
Again for large data sets this can be considered linear in terms of n.
For storing the computed K centroids it requires O(Kd) extra space. Hence, this step too is linear in terms of n making the complete K-means algorithm linear and very efficient for large sized data sets.

NU-MineBench: A CPU Benchmark
In this section we will take a look at NU-MineBench data mining benchmark suite [26]  Partition the n data points equally among all T threads.

11.
Start parallel execution for each thread t in T .

12.
for each data-point p do 13 End parallel execution.

21.
for each cluster k in K do 22. for each thread t in T do 23.
for each dimension d i in d do 25.   Since, it is hard to know the number of iterations it might take for the K-means to end, NU-MineBench takes the maximum number of iterations M itr as an additional input.
Step 1 initializes the first K data points as the initial centroids. During

Our Modifications
NU-MineBench's performance relies heavily on cache utilization. To avoid any shared memory conflicts between the parallel executing threads, local variables newClusterSize and newCentroid are maintained separately for each thread. Also, while computing new centroids (Step 21) only a single thread is used because the overhead of launching multiple threads to perform a tree-based reduction is much higher than doing flat reduction using a single thread. This is based on the assumption that the total number of available threads are going to be low keeping the number of values to be reduced small.
But we found that speedup achieved with NU-MineBench decreases as the value of K is lowered. In fact, for K = 2, we saw an increase in execution time with increase in thread count; see Figure 2.2a. On investigation we found false sharing to be the limiting factor. While assigning memory for variable newClusterSize, a continuous memory space of size T * K * sizeof (int) bytes is allocated in memory, where first K positions are updated by first thread, next K by second thread and so on. When the value of K is small it is possible that values for multiple threads reside in the same cache block. This results in multiple threads updating the same cache block and so the participant threads have to update the values in their cache with changes done by other threads before doing a read. To overcome this shortcoming we ensure that the values stored by each thread occupy distinct cache blocks. We separate every set of K indices inside the array newClusterSize with unused memory locations. We also make similar changes for newCentroid array. This ensures that even for smaller values of K the speedup obtained is linear in thread count; see Figure 2.2b. Also, since the size of the cache block is generally small (typically 64 bytes), this change does not make any significant impact on the overall memory requirement.

Chapter 3 CUDA: A General Purpose
Parallel Computing Architecture The first ever commercial graphics processing unit (GPU) was designed by NVIDIA in 1999. Because of the ever increasing demand for producing real-time graphics, which requires high arithmetic throughput, manufacturers have always focused on increasing the parallel processing capabilities of the GPUs. Today, GPUs outperform CPUs both in arithmetic processing efficiency and memory bandwidth [11,21].
Since 2003, efforts have been made to use GPUs even for non-graphics applications, especially for scientific work so that their high arithmetic throughput can be fully exploited. In our work we have used NVIDIA's Tesla C1060 and Fermibased Tesla C2070 GPU cards. For implementing K-means algorithm on GPU, we have used Compute Unified Device Architecture (CUDA) [24] API. This chapter will provide a brief introduction to NVIDIA's GPU architecture and CUDA API.
More detailed information is available in literature published by NVIDIA [22,23] and in books written by Farber [10] and Kirk [16].  For data transfer between CPU and GPU there is a host interface connecting GPU with CPU via PCI-Express. Fermi uses a GigaThread global scheduler to distribute execution blocks between the SMs. Each SM contains unified graphics and computing multiprocessors which execute vertex/geometry shader programs as well as parallel computing programs.
We will be concentrating only on the parallel computing capabilities of SMs. A typi-

Memory Hierarchy
Data reuse is an essential requirement for GPUs. This is due to the high latency of DRAM memory. To overcome this challenge caches have been provided, both inside SMs and also between the SMs and global memory, to cache regularly accessed data.

Hardware Multithreading
A thread is the smallest execution unit on a GPU. Threads are created, managed, scheduled and executed by SM's SIMT multithreaded instruction unit in groups of 32 parallel threads called warps. Fermi SMs contain two warp schedulers which  A SIMT warp is composed of individual threads of the same type. All the threads start together at the same program address. Whenever the cores assigned to a warp scheduler are ready, the scheduler selects a warp that is ready to execute and issues the next instruction on that warp's active threads. The instruction gets broadcast to all the participant threads.
Due to branching and predication, a thread may be inactive and may not execute.
Whenever threads diverge, all the branches are executed serially. All the threads that do not fall on a particular path are disabled. Once all the paths have been completed the participant threads again re-converge and start executing the next instruction simultaneously.
Serial execution due to branch divergence only happens inside a warp. Different warps can independently execute common or disjoint paths without effecting each other. Since each warp scheduler has 16 cores to execute upon, an issued warp instruction executes as two sets of 16 threads (half-warps) over two processor cycles.
Also, the maximum number of warps that can reside simultaneously on an SM gets

CUDA Execution Model
During execution of CUDA programs, the program is executed with the GPU acting as a co-processor of CPU. While executing a GPU program, the code starts execution on the CPU. To ask the CPU to execute a piece of code on the GPU, a special call called kernel invocation is used. It is an asynchronous call to the CUDA driver which loads the program on GPU and control immediately comes back to the CPU to execute the next instruction. All transfers of data, launching of GPU computing functions (kernels) and any other interaction between CPU and GPU are handled by the runtime driver. To map a large computational problem on the highly parallel GPU architecture, the problem is divided into smaller independent components that can be solved in parallel.
Before scheduling the large number of concurrent threads on to the GPU, they are partitioned into disjoint components called thread blocks. All threads in a thread

CUDA C Runtime
For programming using CUDA, we have used CUDA's C Runtime API. We mention here few basic functions provided by the C runtime. An exhaustive listing is present in CUDA reference manual [24].
• Kernel invocation: A kernel is declared like a normal C function except that it uses the qualifier, global .  Here grid size specifies the grid size for blocks and block size specifies the grid size for threads inside every block. cudaMemcpy is used to copy data from host to device memory, or from device to device memory and from device to host memory.

Chapter 4 Finding Nearest Centroid
In this chapter we will create a parallel implementation for assigning data points to their nearest cluster. As explained in Section 2.3.1, the time complexity of this step is linear in terms of the number of data points. Also, calculation of distance for each data point does not depend on any other data point and so processing for all the data points can continue in parallel.
We will pay special attention to the arrangement of data points and centroids inside GPU memory. Also, because of the limited on-chip memory available in a GPU core we need to optimize our implementation such that loading of data points inside on-chip memory is in parallel with distance calculation, so that the cores are never idle.
We will start with an implementation for the special case where the input data points are low-dimensional. We will look at the various possible storage arrangements and their effect on the processing efficiency. On the basis of our observations for the special case of low-dimensional data points, we will finally create a generic implementation which can efficiently parallelize assignment of data points of any dimension and is fully scalable.
We will continue to use the naming convention introduced in Section 2.3.1. For all remaining chapters same naming convention will be used.

Special Case: Low-dimensional Input
We will consider the case where the dimension d of input data is small enough to load one complete data point inside the on-chip registers available to a thread. Since, the number of registers available per thread are limited (64 and 128 in C2070 and C1060 respectively), we will only be considering inputs with value of d up to 22.
Each thread will be assigned a data point. It will calculate the distance of the data-point from all the k centroids. The index of the nearest centroid will be stored in an array of length n and will be the final output.

Storage of data points
The data points are stored in device memory as it has the maximum capacity. Also, once copied into the device memory the data points will never need to be re-arranged so the cost of copying data from host memory to device memory is one time.
While loading the data points from device memory, threads belonging to the same warp load the same dimension of successive data points. To ensure that each load by a warp completes in a single transaction the data points are arranged as Still, if value of n is not a multiple of transaction-width of device memory, a single read by warp might require two transactions. To avoid this we use cudaM allocP itch function to allocate memory for data points with padding, ensuring that first value for each dimension starts at transaction boundary. cudaM emcpy2D is used to copy the data from host into this padded memory.

Loading of data points
Since a data-point can be completely loaded inside a thread, all the threads load their data points only once at the beginning of the kernel. While declaring the register array inside a thread we can't use a variable d for specifying array length.

Loading of centroids
All the threads need to access the same K centroids. This gives an excellent opportunity for reusing the centroid values loaded from device memory. This can be done by loading the centroids in shared memory. Also, all the threads inside a warp access the same dimension of a single centroid. Due to this broadcast access, constant memory can also be used for storing the centroid. We compare both the approaches.

Centroid in shared memory
We consider the optimum case where value K and d are such that all the centroids can be accommodated in the shared memory at once. All the threads after loading their data points, load all the centroids in shared memory.
Threads in the same block wait till all the threads finish loading the centroids to ensure that all centroid values have been loaded into shared memory. Once centroids are loaded no more device memory accesses are required and each thread computes the distance from all the centroids by using data point present in its on-chip registers and centroids present in shared memory. Finally, the index of the closest centroid is stored in the membership array in a coalesced manner.

Centroid in constant memory
In the case of access from constant memory, once the data point has been loaded by a thread, it can directly start accessing the centroid values for distance computation.
Also, since each thread first computes distance from one centroid before moving onto the next centroid, we store centroids as [K][d] to ensure spatial locality inside constant memory cache.

Centroid in device memory
For the case of Fermi, data in device memory is also cached (until written to) in on-chip L1 cache. So for C2070, we also try accessing centroid directly from device memory instead of loading it into shared memory. In this case also, after loading of data points each thread can directly start computing the distance from centroids.
We set the size of the L1 cache to 48KB, so that more space is available for caching the centroids. For C1060, constant memory gives much better performance than shared memory. It also outperforms the case when there is no loading delay in shared memory.

Analysis
This clearly shows that the accesses from the constant cache are much faster than those from the shared memory. Our benchmarking experiments also showed that throughput achieved from shared memory accesses is three-fourth of the throughput achievable by register accesses.
For the last row in   With constant memory, we are able to achieve almost 89% of this peak throughput.
On C2070, the L1 cache used for caching accesses from global memory is not able to perform as well as the cache provided by constant memory. Here also, when the centroids are in shared memory, loading of centroids from device memory causes very small overhead, as can be seen by comparing the second last and the third last columns of Constant memory again performs better than shared memory, but the advantage is much less in comparison to C1060. Three single precision floating point operations every two cycles can give a maximum throughput of around 760 GFlop/s on C2070.
We are able to achieve around 75% of it. One major issue with achieving high throughput on Fermi based GPUs is that due to the presence of two warp schedulers higher number of warps are needed to keep both schedulers busy and hide the latency in comparison to C1060.
While we use 256 threads per block for constant memory kernel, we need as many as 768 threads per block to achieve the maximum throughput for shared memory.
Since with constant memory we can hide latency with less number of threads, we

Generic Implementation
Based on our observations from the last section we create a generic implementation for label assignment. Once again, each thread is going to be responsible for only one data point. So for the same reasons as mentioned in Section 4. Keeping centroids in constant memory ensures that each thread works completely independent of other threads. Also, on-chip constant memory cache is separate from the L1 cache present on Fermi. So we will also avoid thrashing in the L1 cache.
Finally, as observed in last section, less number of threads are needed as compared to shared memory for achieving high throughput with constant memory reads. This gives us the freedom to use more registers per thread to hide latencies of global memory reads.

Loading of data points
Unlike the low-dimensional case, we cannot assume that every time we will have enough registers to read a data point completely. As a result, a thread would only be able to store some d i dimensions of its data point at any given point. Also, now there won't be a single one-time load of data points from device memory.
Since we will keep on loading dimensions of the data points and calculating the distance simultaneously, we need to ensure that for each dimension loaded there are enough number of compute instructions so as to cover the latency for read of the next dimension. So we load only single dimension at a time and update the distance from maximum number of centroids while the next dimension loads from the device memory.

Loading of centroids
While previously each thread calculated the distance from one centroid before loading the next centroid, here since the thread has only one dimension of its data point loaded in its registers, it loads the same dimension for some k centroids and updates its distance with them.
Although we still have broadcast access of centroids, the first k calls are made to load the same dimension of successive k centroids. So we store the centroids in constant memory as [d][K] to achieve higher spatial locality. The value k depends on the latency of global reads. We keep it as the template variable this time so that we can specify it as the length for our distance array.

Scalability
Above implementation does not assume any of the input parameters to fall within any range. The data points are processed by each thread independent of all other threads making it scalable to any value of n. If value of n is so high that all the data points cannot be accommodated in device memory at once, we can run the kernel multiple times with device memory containing the data points to be processed in the current invocation.  Since there are no compute operations to hide the latency of data reads from device memory, we try to ensure that all the memory reads are fully coalesced. Also, to avoid repeated writes to device memory (for updation of sum by each and every data point), we would like to go through all the data points, keep reducing the coordinate sum inside on-chip memory and then finally store it inside device memory once all the points have been processed. We create an intra-block reduction kernel to achieve this.
Reduction of co-ordinates for every cluster can be performed in parallel to other clusters. Each CUDA block reduces the co-ordinates for a particular cluster, k. Also, CHAPTER 5. COMPUTE NEW CENTROIDS 33 the total number of blocks launched are a multiple of total number of clusters K.
Data points are divided equally among all the blocks reducing for the same cluster.
After all the blocks have finished execution, for every cluster k, we reduce the sums of co-ordinates of all the blocks that were invoked for it to get the final reduced sum for k. The count of members is also reduced along with the co-ordinates inside the blocks and so we finally get the new centroid by taking mean of the reduced sum of each dimension. This step is performed in a separate inter-block reduction kernel.

Intra-block Reduction
Each block knows the cluster k it is reducing for. Also, it knows the range of data points it has to reduce. For each data-point assigned to k, it adds the co-ordinates in its on-chip memory and increments count. Once all the points have been processed, the reduced sums and count are copied into device memory.

Reduce member co-ordinates
This is the most crucial step as it involves maximum device memory reads. To ensure that co-ordinates of every member data point are read from device memory in a coalesced manner, separate reduction is performed by each warp inside the block.
Data points to be processed by each block are partitioned equally among all the warps. Each warp processes its assigned data points without any dependence on threads from other warps as shown in Figure 5

Inter-block Reduction
After the completion of intra-block reduction, a separate kernel is invoked to reduce the values of sum and count that were stored for each cluster in Section 5.1.3. One kernel block is created for each cluster and so the total number of blocks is equal to the number of clusters. Each kernel block reduces the values stored by all the blocks in Section 5.1.3 for the cluster k assigned to it. After reducing the sum and count values for cluster k, it finds the mean for every dimension. These means are the final co-ordinates of the centroid.
Finally, each block stores the obtained co-ordinates of its centroid k, inside device memory.

Chapter 6
Experimental Results

Experimental Setup
The CUDA kernels are executed on NVIDIA's C1060 and C2070 GPUs. For our CPU executions, we use a quad-processor SMP, Xeon server. The four processors are hex-core, Intel X7460 CPUs running at 2.66GHz.

Comparison with CPU
We compare our GPU implementation with our modified NU-Minebench implementation (mentioned in Section 2.4.2) running on our 24-core Xeon server. We use two real world datasets as input for our comparisons.
Our first input dataset is KDD Cup 1999 [15]. It contains 4,898,431 data points, each having 41 attributes. Figures 6.1a and 6.1b show the speedup obtained with change in number of data-points and clusters respectively.
Next we perform comparison for Covertype Data Set [7]. It contains 581,012 data points, each having 54 attributes. Figures 6.2a and 6.2b show the speedup CHAPTER 6. EXPERIMENTAL RESULTS

39
obtained with change in number of data-points and clusters respectively.
Next we perform comparison for Covertype Data Set [7]. It contains 581,012 data points, each having 54 attributes. Figures 6.2a and 6.2b show the speedup obtained with change in number of data-points and clusters respectively.
We note that although C1060 has 240 cores, the frequency of CPU cores is twice that of GPU cores. For the case of Fermi it is in fact 2.5x times. Also, during data load intensive reduction operation, CPU has larger caches: 36 MB L2 and 64 MB L3. In comparison, C1060 does not have any L2 cache causing much larger latency for load operations during reduction.
On Fermi-based C2070 there is a 768 KB L2 cache. It enables C2070 to achieve better performance during reduction in comparison with C1060. Also, the availability of ballot function ensures that the threads in a warp do not have to use shared memory for checking the membership of data points during reduction. This is why C2070 is consistently able to achieve a better speedup as compared to C2070. al [17], University of Virginia [4] and GPUMiner [9]. The execution times are taken from [17] for a sample dataset provided by KDD Cup 1999 [15]. The sample contains 51,200 data points, each having 34 attributes. Our implementation is able to perform better than the published results in Li et al [17].  [17]. Value for n is 51,200, K is 32 and the execution time is taken for 100 iteration of K-means. Our K-means consistently performs better than [17]. Even when d = 160, we are able to achieve over 1.5x

Chapter 7
Conclusions and Future Work The main aim of this study was to achieve high throughput for K-means algorithm on NVIDIA'a GPUs. We were able to achieve around 89% of the peak throughput on Tesla C1060 GPU. On Fermi-based Tesla C2070 GPU we were able to achieve 75% of the peak throughput for K-means operation. Also, we have been able to achieve better efficiency than the currently available GPU implementations for basic K-means algorithm. Our implementation is scalable and does not have any limiting conditions on the input sets it can process. One immediate extension of this would be to use the generic implementation across multiple GPUs. This way we could further reduce the execution time, especially during label assignment.
There have been many extensions of the original K-means algorithm. In our study, we have implemented the basic K-means algorithm. It would be interesting to implement extensions of K-means algorithm using our generic implementation as the base. There are extensions which decrease the number of reduction operations required for computing new centroids. Such extensions might be able to achieve even higher efficiency on GPUs.